for hadoop nodes i guess yet another option to soft-link the .so into hadoop's native lib folder
On Mon, Nov 12, 2012 at 5:37 PM, Dmitriy Lyubimov <[email protected]> wrote: > I actually want to defer this to hadoop admins, we just need to create a > procedure for setting up nodes. Ideally as simple as possible. something > like > > 1) setup R > 2) install.packages("rJava","RProtoBuf","crunchR") > 3) R CMD javareconf > 3) add result of R --vanilla <<< 'system.file("jri", package="rJava") to > either mapred command lines or LD_LIBRARY_PATH... > > but it will depend on their versions of hadoop, jre etc. I hoped crunch > might have something to hide a lot of that complexity (since it is about > hiding complexities, for the most part :) ) besides hadoop has a way to > ship .so's to the backend so if crunch had an api to do something similar > it is conceivable that driver might yank and ship it too to hide that > complexity as well. But then there's a host of issues how to handle > potentially different rJava versions installed on different nodes... So, it > increasingly looks like something we might want to defer to sysops to do > with approximate set of requirements . > > > On Mon, Nov 12, 2012 at 5:29 PM, Josh Wills <[email protected]> wrote: > >> On Mon, Nov 12, 2012 at 5:17 PM, Dmitriy Lyubimov <[email protected]> >> wrote: >> >> > so java tasks need to be able to load libjri.so from >> > whatever system.file("jri", package="rJava") says. >> > >> > Traditionally, these issues were handled with -Djava.library.path. >> > Apparently there's nothing java task can do to enable loadLibrary() >> command >> > to see the damn library once started. But -Djava.library.path requires >> for >> > nodes to configure and lock jvm command line from modifications of the >> > client. which is fine. >> > >> > I also discovered that LD_LIBRARY_PATH actually works with jre 1.6 >> (again). >> > >> > but... any other suggestions about best practice configuring crunch to >> run >> > user's .so's? >> > >> >> Not off the top of my head. I suspect that whatever you come up with will >> become the "best practice." :) >> >> > >> > thanks. >> > >> > >> > >> > >> > >> > >> > On Sun, Nov 11, 2012 at 1:41 PM, Josh Wills <[email protected]> >> wrote: >> > >> > > I believe that is a safe assumption, at least right now. >> > > >> > > >> > > On Sun, Nov 11, 2012 at 1:38 PM, Dmitriy Lyubimov <[email protected]> >> > > wrote: >> > > >> > > > Question. >> > > > >> > > > So in Crunch api, initialize() doesn't get an emitter. and the >> process >> > > gets >> > > > emitter every time. >> > > > >> > > > However, my guess any single reincranation of a DoFn object in the >> > > backend >> > > > will always be getting the same emitter thru its lifecycle. Is it an >> > > > admissible assumption or there's currently a counter example to >> that? >> > > > >> > > > The problem is that as i implement the two way pipeline of input and >> > > > emitter data between R and Java, I am bulking these calls together >> for >> > > > performance reasons. Each individual datum in these chunks of data >> will >> > > not >> > > > have attached emitter function information to them in any way. >> (well it >> > > > could but it would be a performance killer and i bet emitter never >> > > > changes). >> > > > >> > > > So, thoughts? can i assume emitter never changes between first and >> lass >> > > > call to DoFn instance? >> > > > >> > > > thanks. >> > > > >> > > > >> > > > On Mon, Oct 29, 2012 at 6:32 PM, Dmitriy Lyubimov < >> [email protected]> >> > > > wrote: >> > > > >> > > > > yes... >> > > > > >> > > > > i think it worked for me before, although just adding all jars >> from R >> > > > > package distribution would be a little bit more appropriate >> approach >> > > > > -- but it creates a problem with jars in dependent R packages. I >> > think >> > > > > it would be much easier to just compile a hadoop-job file and >> stick >> > it >> > > > > in rather than doing cherry-picking of individual jars from who >> knows >> > > > > how many locations. >> > > > > >> > > > > i think i used the hadoop job format with distributed cache before >> > and >> > > > > it worked... at least with Pig "register jar" functionality. >> > > > > >> > > > > ok i guess i will just try if it works. >> > > > > >> > > > > On Mon, Oct 29, 2012 at 6:24 PM, Josh Wills <[email protected]> >> > > wrote: >> > > > > > On Mon, Oct 29, 2012 at 5:46 PM, Dmitriy Lyubimov < >> > [email protected] >> > > > >> > > > > wrote: >> > > > > > >> > > > > >> Great! so it is in Crunch. >> > > > > >> >> > > > > >> does it support hadoop-job jar format or only pure java jars? >> > > > > >> >> > > > > > >> > > > > > I think just pure jars-- you're referring to hadoop-job format >> as >> > > > having >> > > > > > all the dependencies in a lib/ directory within the jar? >> > > > > > >> > > > > > >> > > > > >> >> > > > > >> On Mon, Oct 29, 2012 at 5:10 PM, Josh Wills < >> [email protected]> >> > > > > wrote: >> > > > > >> > On Mon, Oct 29, 2012 at 5:04 PM, Dmitriy Lyubimov < >> > > > [email protected]> >> > > > > >> wrote: >> > > > > >> > >> > > > > >> >> I think i need functionality to add more jars (or external >> > > > > hadoop-jar) >> > > > > >> >> to drive that from an R package. Just setting job jar by >> class >> > is >> > > > not >> > > > > >> >> enough. I can push overall job-jar as an addiitonal jar to R >> > > > package; >> > > > > >> >> however, i cannot really run hadoop command line on it, i >> need >> > to >> > > > set >> > > > > >> >> up classpath thru RJava. >> > > > > >> >> >> > > > > >> >> Traditional single hadoop job jar will unlikely work here >> since >> > > we >> > > > > >> >> cannot hardcode pipelines in java code but rather have to >> > > construct >> > > > > >> >> them on the fly. (well, we could serialize pipeline >> definitions >> > > > from >> > > > > R >> > > > > >> >> and then replay them in a driver -- but that's too >> cumbersome >> > and >> > > > > more >> > > > > >> >> work than it has to be.) There's no reason why i shouldn't >> be >> > > able >> > > > to >> > > > > >> >> do pig-like "register jar" or "setJobJar" (mahout-like) when >> > > > kicking >> > > > > >> >> off a pipeline. >> > > > > >> >> >> > > > > >> > >> > > > > >> > o.a.c.util.DistCache.addJarToDistributedCache? >> > > > > >> > >> > > > > >> > >> > > > > >> >> >> > > > > >> >> >> > > > > >> >> On Mon, Oct 29, 2012 at 10:17 AM, Dmitriy Lyubimov < >> > > > > [email protected]> >> > > > > >> >> wrote: >> > > > > >> >> > Ok, sounds very promising... >> > > > > >> >> > >> > > > > >> >> > i'll try to start digging on the driver part this week >> then >> > > > > (Pipeline >> > > > > >> >> > wrapper in R5). >> > > > > >> >> > >> > > > > >> >> > On Sun, Oct 28, 2012 at 11:56 AM, Josh Wills < >> > > > [email protected] >> > > > > > >> > > > > >> >> wrote: >> > > > > >> >> >> On Fri, Oct 26, 2012 at 2:40 PM, Dmitriy Lyubimov < >> > > > > [email protected] >> > > > > >> > >> > > > > >> >> wrote: >> > > > > >> >> >>> Ok, cool. >> > > > > >> >> >>> >> > > > > >> >> >>> So what state is Crunch in? I take it is in a fairly >> > advanced >> > > > > state. >> > > > > >> >> >>> So every api mentioned in the FlumeJava paper is >> working , >> > > > > right? >> > > > > >> Or >> > > > > >> >> >>> there's something that is not working specifically? >> > > > > >> >> >> >> > > > > >> >> >> I think the only thing in the paper that we don't have >> in a >> > > > > working >> > > > > >> >> >> state is MSCR fusion. It's mostly just a question of >> > > > prioritizing >> > > > > it >> > > > > >> >> >> and getting the work done. >> > > > > >> >> >> >> > > > > >> >> >>> >> > > > > >> >> >>> On Fri, Oct 26, 2012 at 2:31 PM, Josh Wills < >> > > > [email protected] >> > > > > > >> > > > > >> >> wrote: >> > > > > >> >> >>>> Hey Dmitriy, >> > > > > >> >> >>>> >> > > > > >> >> >>>> Got a fork going and looking forward to playing with >> > crunchR >> > > > > this >> > > > > >> >> weekend-- >> > > > > >> >> >>>> thanks! >> > > > > >> >> >>>> >> > > > > >> >> >>>> J >> > > > > >> >> >>>> >> > > > > >> >> >>>> On Wed, Oct 24, 2012 at 1:28 PM, Dmitriy Lyubimov < >> > > > > >> [email protected]> >> > > > > >> >> wrote: >> > > > > >> >> >>>> >> > > > > >> >> >>>>> Project template https://github.com/dlyubimov/crunchR >> > > > > >> >> >>>>> >> > > > > >> >> >>>>> Default profile does not compile R artifact . R >> profile >> > > > > compiles R >> > > > > >> >> >>>>> artifact. for convenience, it is enabled by supplying >> -DR >> > > to >> > > > > mvn >> > > > > >> >> >>>>> command line, e.g. >> > > > > >> >> >>>>> >> > > > > >> >> >>>>> mvn install -DR >> > > > > >> >> >>>>> >> > > > > >> >> >>>>> there's also a helper that installs the snapshot >> version >> > of >> > > > the >> > > > > >> >> >>>>> package in the crunchR module. >> > > > > >> >> >>>>> >> > > > > >> >> >>>>> There's RJava and JRI java dependencies which i did >> not >> > > find >> > > > > >> anywhere >> > > > > >> >> >>>>> in public maven repos; so it is installed into my >> github >> > > > maven >> > > > > >> repo >> > > > > >> >> so >> > > > > >> >> >>>>> far. Should compile for 3rd party. >> > > > > >> >> >>>>> >> > > > > >> >> >>>>> -DR compilation requires R, RJava and optionally, >> > > RProtoBuf. >> > > > R >> > > > > Doc >> > > > > >> >> >>>>> compilation requires roxygen2 (i think). >> > > > > >> >> >>>>> >> > > > > >> >> >>>>> For some reason RProtoBuf fails to import into another >> > > > package, >> > > > > >> got a >> > > > > >> >> >>>>> weird exception when i put @import RProtoBuf into >> > crunchR, >> > > so >> > > > > >> >> >>>>> RProtoBuf is now in "Suggests" category. Down the road >> > that >> > > > may >> > > > > >> be a >> > > > > >> >> >>>>> problem though... >> > > > > >> >> >>>>> >> > > > > >> >> >>>>> other than the template, not much else has been done >> so >> > > > far... >> > > > > >> >> finding >> > > > > >> >> >>>>> hadoop libraries and adding it to the package path on >> > > > > >> initialization >> > > > > >> >> >>>>> via "hadoop classpath"... adding Crunch jars and its >> > > > > >> non-"provided" >> > > > > >> >> >>>>> transitives to the crunchR's java part... >> > > > > >> >> >>>>> >> > > > > >> >> >>>>> No legal stuff... >> > > > > >> >> >>>>> >> > > > > >> >> >>>>> No readmes... complete stealth at this point. >> > > > > >> >> >>>>> >> > > > > >> >> >>>>> On Thu, Oct 18, 2012 at 12:35 PM, Dmitriy Lyubimov < >> > > > > >> >> [email protected]> >> > > > > >> >> >>>>> wrote: >> > > > > >> >> >>>>> > Ok, cool. I will try to roll project template by >> some >> > > time >> > > > > next >> > > > > >> >> week. >> > > > > >> >> >>>>> > we can start with prototyping and benchmarking >> > something >> > > > > really >> > > > > >> >> >>>>> > simple, such as parallelDo(). >> > > > > >> >> >>>>> > >> > > > > >> >> >>>>> > My interim goal is to perhaps take some more or less >> > > simple >> > > > > >> >> algorithm >> > > > > >> >> >>>>> > from Mahout and demonstrate it can be solved with >> > Rcrunch >> > > > (or >> > > > > >> >> whatever >> > > > > >> >> >>>>> > name it has to be) in a comparable time >> (performance) >> > but >> > > > > with >> > > > > >> much >> > > > > >> >> >>>>> > fewer lines of code. (say one of factorization or >> > > > clustering >> > > > > >> >> things) >> > > > > >> >> >>>>> > >> > > > > >> >> >>>>> > >> > > > > >> >> >>>>> > On Wed, Oct 17, 2012 at 10:24 PM, Rahul < >> > > [email protected] >> > > > > >> > > > > >> wrote: >> > > > > >> >> >>>>> >> I am not much of R user but I am interested to see >> how >> > > > well >> > > > > we >> > > > > >> can >> > > > > >> >> >>>>> integrate >> > > > > >> >> >>>>> >> the two. I would be happy to help. >> > > > > >> >> >>>>> >> >> > > > > >> >> >>>>> >> regards, >> > > > > >> >> >>>>> >> Rahul >> > > > > >> >> >>>>> >> >> > > > > >> >> >>>>> >> On 18-10-2012 04:04, Josh Wills wrote: >> > > > > >> >> >>>>> >>> >> > > > > >> >> >>>>> >>> On Wed, Oct 17, 2012 at 3:07 PM, Dmitriy Lyubimov >> < >> > > > > >> >> [email protected]> >> > > > > >> >> >>>>> >>> wrote: >> > > > > >> >> >>>>> >>>> >> > > > > >> >> >>>>> >>>> Yep, ok. >> > > > > >> >> >>>>> >>>> >> > > > > >> >> >>>>> >>>> I imagine it has to be an R module so I can set >> up a >> > > > maven >> > > > > >> >> project >> > > > > >> >> >>>>> >>>> with java/R code tree (I have been doing that a >> lot >> > > > > lately). >> > > > > >> Or >> > > > > >> >> if you >> > > > > >> >> >>>>> >>>> have a template to look at, it would be useful i >> > guess >> > > > > too. >> > > > > >> >> >>>>> >>> >> > > > > >> >> >>>>> >>> No, please go right ahead. >> > > > > >> >> >>>>> >>> >> > > > > >> >> >>>>> >>>> >> > > > > >> >> >>>>> >>>> On Wed, Oct 17, 2012 at 3:02 PM, Josh Wills < >> > > > > >> >> [email protected]> >> > > > > >> >> >>>>> wrote: >> > > > > >> >> >>>>> >>>>> >> > > > > >> >> >>>>> >>>>> I'd like it to be separate at first, but I am >> happy >> > > to >> > > > > help. >> > > > > >> >> Github >> > > > > >> >> >>>>> >>>>> repo? >> > > > > >> >> >>>>> >>>>> On Oct 17, 2012 2:57 PM, "Dmitriy Lyubimov" < >> > > > > >> [email protected] >> > > > > >> >> > >> > > > > >> >> >>>>> wrote: >> > > > > >> >> >>>>> >>>>> >> > > > > >> >> >>>>> >>>>>> Ok maybe there's a benefit to try a JRI/RJava >> > > > prototype >> > > > > on >> > > > > >> >> top of >> > > > > >> >> >>>>> >>>>>> Crunch for something simple. This should both >> save >> > > > time >> > > > > and >> > > > > >> >> prove or >> > > > > >> >> >>>>> >>>>>> disprove if Crunch via RJava integration is >> > viable. >> > > > > >> >> >>>>> >>>>>> >> > > > > >> >> >>>>> >>>>>> On my part i can try to do it within Crunch >> > > framework >> > > > > or we >> > > > > >> >> can keep >> > > > > >> >> >>>>> >>>>>> it completely separate. >> > > > > >> >> >>>>> >>>>>> >> > > > > >> >> >>>>> >>>>>> -d >> > > > > >> >> >>>>> >>>>>> >> > > > > >> >> >>>>> >>>>>> On Wed, Oct 17, 2012 at 2:08 PM, Josh Wills < >> > > > > >> >> [email protected]> >> > > > > >> >> >>>>> >>>>>> wrote: >> > > > > >> >> >>>>> >>>>>>> >> > > > > >> >> >>>>> >>>>>>> I am an avid R user and would be into it-- who >> > gave >> > > > the >> > > > > >> >> talk? Was >> > > > > >> >> >>>>> it >> > > > > >> >> >>>>> >>>>>>> Murray Stokely? >> > > > > >> >> >>>>> >>>>>>> >> > > > > >> >> >>>>> >>>>>>> On Wed, Oct 17, 2012 at 2:05 PM, Dmitriy >> > Lyubimov < >> > > > > >> >> >>>>> [email protected]> >> > > > > >> >> >>>>> >>>>>> >> > > > > >> >> >>>>> >>>>>> wrote: >> > > > > >> >> >>>>> >>>>>>>> >> > > > > >> >> >>>>> >>>>>>>> Hello, >> > > > > >> >> >>>>> >>>>>>>> >> > > > > >> >> >>>>> >>>>>>>> I was pretty excited to learn of Google's >> > > experience >> > > > > of R >> > > > > >> >> mapping >> > > > > >> >> >>>>> of >> > > > > >> >> >>>>> >>>>>>>> flume java on one of recent BARUGs. I think a >> > lot >> > > of >> > > > > >> >> applications >> > > > > >> >> >>>>> >>>>>>>> similar to what we do in Mahout could be >> > > prototyped >> > > > > using >> > > > > >> >> flume R. >> > > > > >> >> >>>>> >>>>>>>> >> > > > > >> >> >>>>> >>>>>>>> I did not quite get the details of Google >> > > > > implementation >> > > > > >> of >> > > > > >> >> R >> > > > > >> >> >>>>> >>>>>>>> mapping, >> > > > > >> >> >>>>> >>>>>>>> but i am not sure if just a direct mapping >> from >> > R >> > > to >> > > > > >> Crunch >> > > > > >> >> would >> > > > > >> >> >>>>> be >> > > > > >> >> >>>>> >>>>>>>> sufficient (and, for most part, efficient). >> > > > RJava/JRI >> > > > > and >> > > > > >> >> jni >> > > > > >> >> >>>>> seem to >> > > > > >> >> >>>>> >>>>>>>> be a pretty terrible performer to do that >> > > directly. >> > > > > >> >> >>>>> >>>>>>>> >> > > > > >> >> >>>>> >>>>>>>> >> > > > > >> >> >>>>> >>>>>>>> on top of it, I am thinknig if this project >> > could >> > > > > have a >> > > > > >> >> >>>>> contributed >> > > > > >> >> >>>>> >>>>>>>> adapter to Mahout's distributed matrices, >> that >> > > would >> > > > > be >> > > > > >> >> just a >> > > > > >> >> >>>>> very >> > > > > >> >> >>>>> >>>>>>>> good synergy. >> > > > > >> >> >>>>> >>>>>>>> >> > > > > >> >> >>>>> >>>>>>>> Is there anyone interested in >> > > contributing/advising >> > > > > for >> > > > > >> open >> > > > > >> >> >>>>> source >> > > > > >> >> >>>>> >>>>>>>> version of flume R support? Just gauging >> > interest, >> > > > > Crunch >> > > > > >> >> list >> > > > > >> >> >>>>> seems >> > > > > >> >> >>>>> >>>>>>>> like a natural place to poke. >> > > > > >> >> >>>>> >>>>>>>> >> > > > > >> >> >>>>> >>>>>>>> Thanks . >> > > > > >> >> >>>>> >>>>>>>> >> > > > > >> >> >>>>> >>>>>>>> -Dmitriy >> > > > > >> >> >>>>> >>>>>>> >> > > > > >> >> >>>>> >>>>>>> >> > > > > >> >> >>>>> >>>>>>> >> > > > > >> >> >>>>> >>>>>>> -- >> > > > > >> >> >>>>> >>>>>>> Director of Data Science >> > > > > >> >> >>>>> >>>>>>> Cloudera >> > > > > >> >> >>>>> >>>>>>> Twitter: @josh_wills >> > > > > >> >> >>>>> >>> >> > > > > >> >> >>>>> >>> >> > > > > >> >> >>>>> >>> >> > > > > >> >> >>>>> >> >> > > > > >> >> >>>>> >> > > > > >> >> >>>> >> > > > > >> >> >>>> >> > > > > >> >> >>>> >> > > > > >> >> >>>> -- >> > > > > >> >> >>>> Director of Data Science >> > > > > >> >> >>>> Cloudera <http://www.cloudera.com> >> > > > > >> >> >>>> Twitter: @josh_wills <http://twitter.com/josh_wills> >> > > > > >> >> >> > > > > >> > >> > > > > >> > >> > > > > >> > >> > > > > >> > -- >> > > > > >> > Director of Data Science >> > > > > >> > Cloudera <http://www.cloudera.com> >> > > > > >> > Twitter: @josh_wills <http://twitter.com/josh_wills> >> > > > > >> >> > > > > > >> > > > > > >> > > > > > >> > > > > > -- >> > > > > > Director of Data Science >> > > > > > Cloudera <http://www.cloudera.com> >> > > > > > Twitter: @josh_wills <http://twitter.com/josh_wills> >> > > > > >> > > > >> > > >> > >> >> >> >> -- >> Director of Data Science >> Cloudera <http://www.cloudera.com> >> Twitter: @josh_wills <http://twitter.com/josh_wills> >> > >
