Great! so it is in Crunch. does it support hadoop-job jar format or only pure java jars?
On Mon, Oct 29, 2012 at 5:10 PM, Josh Wills <[email protected]> wrote: > On Mon, Oct 29, 2012 at 5:04 PM, Dmitriy Lyubimov <[email protected]> wrote: > >> I think i need functionality to add more jars (or external hadoop-jar) >> to drive that from an R package. Just setting job jar by class is not >> enough. I can push overall job-jar as an addiitonal jar to R package; >> however, i cannot really run hadoop command line on it, i need to set >> up classpath thru RJava. >> >> Traditional single hadoop job jar will unlikely work here since we >> cannot hardcode pipelines in java code but rather have to construct >> them on the fly. (well, we could serialize pipeline definitions from R >> and then replay them in a driver -- but that's too cumbersome and more >> work than it has to be.) There's no reason why i shouldn't be able to >> do pig-like "register jar" or "setJobJar" (mahout-like) when kicking >> off a pipeline. >> > > o.a.c.util.DistCache.addJarToDistributedCache? > > >> >> >> On Mon, Oct 29, 2012 at 10:17 AM, Dmitriy Lyubimov <[email protected]> >> wrote: >> > Ok, sounds very promising... >> > >> > i'll try to start digging on the driver part this week then (Pipeline >> > wrapper in R5). >> > >> > On Sun, Oct 28, 2012 at 11:56 AM, Josh Wills <[email protected]> >> wrote: >> >> On Fri, Oct 26, 2012 at 2:40 PM, Dmitriy Lyubimov <[email protected]> >> wrote: >> >>> Ok, cool. >> >>> >> >>> So what state is Crunch in? I take it is in a fairly advanced state. >> >>> So every api mentioned in the FlumeJava paper is working , right? Or >> >>> there's something that is not working specifically? >> >> >> >> I think the only thing in the paper that we don't have in a working >> >> state is MSCR fusion. It's mostly just a question of prioritizing it >> >> and getting the work done. >> >> >> >>> >> >>> On Fri, Oct 26, 2012 at 2:31 PM, Josh Wills <[email protected]> >> wrote: >> >>>> Hey Dmitriy, >> >>>> >> >>>> Got a fork going and looking forward to playing with crunchR this >> weekend-- >> >>>> thanks! >> >>>> >> >>>> J >> >>>> >> >>>> On Wed, Oct 24, 2012 at 1:28 PM, Dmitriy Lyubimov <[email protected]> >> wrote: >> >>>> >> >>>>> Project template https://github.com/dlyubimov/crunchR >> >>>>> >> >>>>> Default profile does not compile R artifact . R profile compiles R >> >>>>> artifact. for convenience, it is enabled by supplying -DR to mvn >> >>>>> command line, e.g. >> >>>>> >> >>>>> mvn install -DR >> >>>>> >> >>>>> there's also a helper that installs the snapshot version of the >> >>>>> package in the crunchR module. >> >>>>> >> >>>>> There's RJava and JRI java dependencies which i did not find anywhere >> >>>>> in public maven repos; so it is installed into my github maven repo >> so >> >>>>> far. Should compile for 3rd party. >> >>>>> >> >>>>> -DR compilation requires R, RJava and optionally, RProtoBuf. R Doc >> >>>>> compilation requires roxygen2 (i think). >> >>>>> >> >>>>> For some reason RProtoBuf fails to import into another package, got a >> >>>>> weird exception when i put @import RProtoBuf into crunchR, so >> >>>>> RProtoBuf is now in "Suggests" category. Down the road that may be a >> >>>>> problem though... >> >>>>> >> >>>>> other than the template, not much else has been done so far... >> finding >> >>>>> hadoop libraries and adding it to the package path on initialization >> >>>>> via "hadoop classpath"... adding Crunch jars and its non-"provided" >> >>>>> transitives to the crunchR's java part... >> >>>>> >> >>>>> No legal stuff... >> >>>>> >> >>>>> No readmes... complete stealth at this point. >> >>>>> >> >>>>> On Thu, Oct 18, 2012 at 12:35 PM, Dmitriy Lyubimov < >> [email protected]> >> >>>>> wrote: >> >>>>> > Ok, cool. I will try to roll project template by some time next >> week. >> >>>>> > we can start with prototyping and benchmarking something really >> >>>>> > simple, such as parallelDo(). >> >>>>> > >> >>>>> > My interim goal is to perhaps take some more or less simple >> algorithm >> >>>>> > from Mahout and demonstrate it can be solved with Rcrunch (or >> whatever >> >>>>> > name it has to be) in a comparable time (performance) but with much >> >>>>> > fewer lines of code. (say one of factorization or clustering >> things) >> >>>>> > >> >>>>> > >> >>>>> > On Wed, Oct 17, 2012 at 10:24 PM, Rahul <[email protected]> wrote: >> >>>>> >> I am not much of R user but I am interested to see how well we can >> >>>>> integrate >> >>>>> >> the two. I would be happy to help. >> >>>>> >> >> >>>>> >> regards, >> >>>>> >> Rahul >> >>>>> >> >> >>>>> >> On 18-10-2012 04:04, Josh Wills wrote: >> >>>>> >>> >> >>>>> >>> On Wed, Oct 17, 2012 at 3:07 PM, Dmitriy Lyubimov < >> [email protected]> >> >>>>> >>> wrote: >> >>>>> >>>> >> >>>>> >>>> Yep, ok. >> >>>>> >>>> >> >>>>> >>>> I imagine it has to be an R module so I can set up a maven >> project >> >>>>> >>>> with java/R code tree (I have been doing that a lot lately). Or >> if you >> >>>>> >>>> have a template to look at, it would be useful i guess too. >> >>>>> >>> >> >>>>> >>> No, please go right ahead. >> >>>>> >>> >> >>>>> >>>> >> >>>>> >>>> On Wed, Oct 17, 2012 at 3:02 PM, Josh Wills < >> [email protected]> >> >>>>> wrote: >> >>>>> >>>>> >> >>>>> >>>>> I'd like it to be separate at first, but I am happy to help. >> Github >> >>>>> >>>>> repo? >> >>>>> >>>>> On Oct 17, 2012 2:57 PM, "Dmitriy Lyubimov" <[email protected] >> > >> >>>>> wrote: >> >>>>> >>>>> >> >>>>> >>>>>> Ok maybe there's a benefit to try a JRI/RJava prototype on >> top of >> >>>>> >>>>>> Crunch for something simple. This should both save time and >> prove or >> >>>>> >>>>>> disprove if Crunch via RJava integration is viable. >> >>>>> >>>>>> >> >>>>> >>>>>> On my part i can try to do it within Crunch framework or we >> can keep >> >>>>> >>>>>> it completely separate. >> >>>>> >>>>>> >> >>>>> >>>>>> -d >> >>>>> >>>>>> >> >>>>> >>>>>> On Wed, Oct 17, 2012 at 2:08 PM, Josh Wills < >> [email protected]> >> >>>>> >>>>>> wrote: >> >>>>> >>>>>>> >> >>>>> >>>>>>> I am an avid R user and would be into it-- who gave the >> talk? Was >> >>>>> it >> >>>>> >>>>>>> Murray Stokely? >> >>>>> >>>>>>> >> >>>>> >>>>>>> On Wed, Oct 17, 2012 at 2:05 PM, Dmitriy Lyubimov < >> >>>>> [email protected]> >> >>>>> >>>>>> >> >>>>> >>>>>> wrote: >> >>>>> >>>>>>>> >> >>>>> >>>>>>>> Hello, >> >>>>> >>>>>>>> >> >>>>> >>>>>>>> I was pretty excited to learn of Google's experience of R >> mapping >> >>>>> of >> >>>>> >>>>>>>> flume java on one of recent BARUGs. I think a lot of >> applications >> >>>>> >>>>>>>> similar to what we do in Mahout could be prototyped using >> flume R. >> >>>>> >>>>>>>> >> >>>>> >>>>>>>> I did not quite get the details of Google implementation of >> R >> >>>>> >>>>>>>> mapping, >> >>>>> >>>>>>>> but i am not sure if just a direct mapping from R to Crunch >> would >> >>>>> be >> >>>>> >>>>>>>> sufficient (and, for most part, efficient). RJava/JRI and >> jni >> >>>>> seem to >> >>>>> >>>>>>>> be a pretty terrible performer to do that directly. >> >>>>> >>>>>>>> >> >>>>> >>>>>>>> >> >>>>> >>>>>>>> on top of it, I am thinknig if this project could have a >> >>>>> contributed >> >>>>> >>>>>>>> adapter to Mahout's distributed matrices, that would be >> just a >> >>>>> very >> >>>>> >>>>>>>> good synergy. >> >>>>> >>>>>>>> >> >>>>> >>>>>>>> Is there anyone interested in contributing/advising for open >> >>>>> source >> >>>>> >>>>>>>> version of flume R support? Just gauging interest, Crunch >> list >> >>>>> seems >> >>>>> >>>>>>>> like a natural place to poke. >> >>>>> >>>>>>>> >> >>>>> >>>>>>>> Thanks . >> >>>>> >>>>>>>> >> >>>>> >>>>>>>> -Dmitriy >> >>>>> >>>>>>> >> >>>>> >>>>>>> >> >>>>> >>>>>>> >> >>>>> >>>>>>> -- >> >>>>> >>>>>>> Director of Data Science >> >>>>> >>>>>>> Cloudera >> >>>>> >>>>>>> Twitter: @josh_wills >> >>>>> >>> >> >>>>> >>> >> >>>>> >>> >> >>>>> >> >> >>>>> >> >>>> >> >>>> >> >>>> >> >>>> -- >> >>>> Director of Data Science >> >>>> Cloudera <http://www.cloudera.com> >> >>>> Twitter: @josh_wills <http://twitter.com/josh_wills> >> > > > > -- > Director of Data Science > Cloudera <http://www.cloudera.com> > Twitter: @josh_wills <http://twitter.com/josh_wills>
