On Mon, Oct 29, 2012 at 5:46 PM, Dmitriy Lyubimov <[email protected]> wrote:
> Great! so it is in Crunch. > > does it support hadoop-job jar format or only pure java jars? > I think just pure jars-- you're referring to hadoop-job format as having all the dependencies in a lib/ directory within the jar? > > On Mon, Oct 29, 2012 at 5:10 PM, Josh Wills <[email protected]> wrote: > > On Mon, Oct 29, 2012 at 5:04 PM, Dmitriy Lyubimov <[email protected]> > wrote: > > > >> I think i need functionality to add more jars (or external hadoop-jar) > >> to drive that from an R package. Just setting job jar by class is not > >> enough. I can push overall job-jar as an addiitonal jar to R package; > >> however, i cannot really run hadoop command line on it, i need to set > >> up classpath thru RJava. > >> > >> Traditional single hadoop job jar will unlikely work here since we > >> cannot hardcode pipelines in java code but rather have to construct > >> them on the fly. (well, we could serialize pipeline definitions from R > >> and then replay them in a driver -- but that's too cumbersome and more > >> work than it has to be.) There's no reason why i shouldn't be able to > >> do pig-like "register jar" or "setJobJar" (mahout-like) when kicking > >> off a pipeline. > >> > > > > o.a.c.util.DistCache.addJarToDistributedCache? > > > > > >> > >> > >> On Mon, Oct 29, 2012 at 10:17 AM, Dmitriy Lyubimov <[email protected]> > >> wrote: > >> > Ok, sounds very promising... > >> > > >> > i'll try to start digging on the driver part this week then (Pipeline > >> > wrapper in R5). > >> > > >> > On Sun, Oct 28, 2012 at 11:56 AM, Josh Wills <[email protected]> > >> wrote: > >> >> On Fri, Oct 26, 2012 at 2:40 PM, Dmitriy Lyubimov <[email protected] > > > >> wrote: > >> >>> Ok, cool. > >> >>> > >> >>> So what state is Crunch in? I take it is in a fairly advanced state. > >> >>> So every api mentioned in the FlumeJava paper is working , right? > Or > >> >>> there's something that is not working specifically? > >> >> > >> >> I think the only thing in the paper that we don't have in a working > >> >> state is MSCR fusion. It's mostly just a question of prioritizing it > >> >> and getting the work done. > >> >> > >> >>> > >> >>> On Fri, Oct 26, 2012 at 2:31 PM, Josh Wills <[email protected]> > >> wrote: > >> >>>> Hey Dmitriy, > >> >>>> > >> >>>> Got a fork going and looking forward to playing with crunchR this > >> weekend-- > >> >>>> thanks! > >> >>>> > >> >>>> J > >> >>>> > >> >>>> On Wed, Oct 24, 2012 at 1:28 PM, Dmitriy Lyubimov < > [email protected]> > >> wrote: > >> >>>> > >> >>>>> Project template https://github.com/dlyubimov/crunchR > >> >>>>> > >> >>>>> Default profile does not compile R artifact . R profile compiles R > >> >>>>> artifact. for convenience, it is enabled by supplying -DR to mvn > >> >>>>> command line, e.g. > >> >>>>> > >> >>>>> mvn install -DR > >> >>>>> > >> >>>>> there's also a helper that installs the snapshot version of the > >> >>>>> package in the crunchR module. > >> >>>>> > >> >>>>> There's RJava and JRI java dependencies which i did not find > anywhere > >> >>>>> in public maven repos; so it is installed into my github maven > repo > >> so > >> >>>>> far. Should compile for 3rd party. > >> >>>>> > >> >>>>> -DR compilation requires R, RJava and optionally, RProtoBuf. R Doc > >> >>>>> compilation requires roxygen2 (i think). > >> >>>>> > >> >>>>> For some reason RProtoBuf fails to import into another package, > got a > >> >>>>> weird exception when i put @import RProtoBuf into crunchR, so > >> >>>>> RProtoBuf is now in "Suggests" category. Down the road that may > be a > >> >>>>> problem though... > >> >>>>> > >> >>>>> other than the template, not much else has been done so far... > >> finding > >> >>>>> hadoop libraries and adding it to the package path on > initialization > >> >>>>> via "hadoop classpath"... adding Crunch jars and its > non-"provided" > >> >>>>> transitives to the crunchR's java part... > >> >>>>> > >> >>>>> No legal stuff... > >> >>>>> > >> >>>>> No readmes... complete stealth at this point. > >> >>>>> > >> >>>>> On Thu, Oct 18, 2012 at 12:35 PM, Dmitriy Lyubimov < > >> [email protected]> > >> >>>>> wrote: > >> >>>>> > Ok, cool. I will try to roll project template by some time next > >> week. > >> >>>>> > we can start with prototyping and benchmarking something really > >> >>>>> > simple, such as parallelDo(). > >> >>>>> > > >> >>>>> > My interim goal is to perhaps take some more or less simple > >> algorithm > >> >>>>> > from Mahout and demonstrate it can be solved with Rcrunch (or > >> whatever > >> >>>>> > name it has to be) in a comparable time (performance) but with > much > >> >>>>> > fewer lines of code. (say one of factorization or clustering > >> things) > >> >>>>> > > >> >>>>> > > >> >>>>> > On Wed, Oct 17, 2012 at 10:24 PM, Rahul <[email protected]> > wrote: > >> >>>>> >> I am not much of R user but I am interested to see how well we > can > >> >>>>> integrate > >> >>>>> >> the two. I would be happy to help. > >> >>>>> >> > >> >>>>> >> regards, > >> >>>>> >> Rahul > >> >>>>> >> > >> >>>>> >> On 18-10-2012 04:04, Josh Wills wrote: > >> >>>>> >>> > >> >>>>> >>> On Wed, Oct 17, 2012 at 3:07 PM, Dmitriy Lyubimov < > >> [email protected]> > >> >>>>> >>> wrote: > >> >>>>> >>>> > >> >>>>> >>>> Yep, ok. > >> >>>>> >>>> > >> >>>>> >>>> I imagine it has to be an R module so I can set up a maven > >> project > >> >>>>> >>>> with java/R code tree (I have been doing that a lot lately). > Or > >> if you > >> >>>>> >>>> have a template to look at, it would be useful i guess too. > >> >>>>> >>> > >> >>>>> >>> No, please go right ahead. > >> >>>>> >>> > >> >>>>> >>>> > >> >>>>> >>>> On Wed, Oct 17, 2012 at 3:02 PM, Josh Wills < > >> [email protected]> > >> >>>>> wrote: > >> >>>>> >>>>> > >> >>>>> >>>>> I'd like it to be separate at first, but I am happy to help. > >> Github > >> >>>>> >>>>> repo? > >> >>>>> >>>>> On Oct 17, 2012 2:57 PM, "Dmitriy Lyubimov" < > [email protected] > >> > > >> >>>>> wrote: > >> >>>>> >>>>> > >> >>>>> >>>>>> Ok maybe there's a benefit to try a JRI/RJava prototype on > >> top of > >> >>>>> >>>>>> Crunch for something simple. This should both save time and > >> prove or > >> >>>>> >>>>>> disprove if Crunch via RJava integration is viable. > >> >>>>> >>>>>> > >> >>>>> >>>>>> On my part i can try to do it within Crunch framework or we > >> can keep > >> >>>>> >>>>>> it completely separate. > >> >>>>> >>>>>> > >> >>>>> >>>>>> -d > >> >>>>> >>>>>> > >> >>>>> >>>>>> On Wed, Oct 17, 2012 at 2:08 PM, Josh Wills < > >> [email protected]> > >> >>>>> >>>>>> wrote: > >> >>>>> >>>>>>> > >> >>>>> >>>>>>> I am an avid R user and would be into it-- who gave the > >> talk? Was > >> >>>>> it > >> >>>>> >>>>>>> Murray Stokely? > >> >>>>> >>>>>>> > >> >>>>> >>>>>>> On Wed, Oct 17, 2012 at 2:05 PM, Dmitriy Lyubimov < > >> >>>>> [email protected]> > >> >>>>> >>>>>> > >> >>>>> >>>>>> wrote: > >> >>>>> >>>>>>>> > >> >>>>> >>>>>>>> Hello, > >> >>>>> >>>>>>>> > >> >>>>> >>>>>>>> I was pretty excited to learn of Google's experience of R > >> mapping > >> >>>>> of > >> >>>>> >>>>>>>> flume java on one of recent BARUGs. I think a lot of > >> applications > >> >>>>> >>>>>>>> similar to what we do in Mahout could be prototyped using > >> flume R. > >> >>>>> >>>>>>>> > >> >>>>> >>>>>>>> I did not quite get the details of Google implementation > of > >> R > >> >>>>> >>>>>>>> mapping, > >> >>>>> >>>>>>>> but i am not sure if just a direct mapping from R to > Crunch > >> would > >> >>>>> be > >> >>>>> >>>>>>>> sufficient (and, for most part, efficient). RJava/JRI and > >> jni > >> >>>>> seem to > >> >>>>> >>>>>>>> be a pretty terrible performer to do that directly. > >> >>>>> >>>>>>>> > >> >>>>> >>>>>>>> > >> >>>>> >>>>>>>> on top of it, I am thinknig if this project could have a > >> >>>>> contributed > >> >>>>> >>>>>>>> adapter to Mahout's distributed matrices, that would be > >> just a > >> >>>>> very > >> >>>>> >>>>>>>> good synergy. > >> >>>>> >>>>>>>> > >> >>>>> >>>>>>>> Is there anyone interested in contributing/advising for > open > >> >>>>> source > >> >>>>> >>>>>>>> version of flume R support? Just gauging interest, Crunch > >> list > >> >>>>> seems > >> >>>>> >>>>>>>> like a natural place to poke. > >> >>>>> >>>>>>>> > >> >>>>> >>>>>>>> Thanks . > >> >>>>> >>>>>>>> > >> >>>>> >>>>>>>> -Dmitriy > >> >>>>> >>>>>>> > >> >>>>> >>>>>>> > >> >>>>> >>>>>>> > >> >>>>> >>>>>>> -- > >> >>>>> >>>>>>> Director of Data Science > >> >>>>> >>>>>>> Cloudera > >> >>>>> >>>>>>> Twitter: @josh_wills > >> >>>>> >>> > >> >>>>> >>> > >> >>>>> >>> > >> >>>>> >> > >> >>>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> -- > >> >>>> Director of Data Science > >> >>>> Cloudera <http://www.cloudera.com> > >> >>>> Twitter: @josh_wills <http://twitter.com/josh_wills> > >> > > > > > > > > -- > > Director of Data Science > > Cloudera <http://www.cloudera.com> > > Twitter: @josh_wills <http://twitter.com/josh_wills> > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
