Re: Flume R -- any interest?

Dmitriy Lyubimov Mon, 12 Nov 2012 17:37:31 -0800

I actually want to defer this to hadoop admins, we just need to create a
procedure for setting up nodes. Ideally as simple as possible. something
like


1) setup R
2) install.packages("rJava","RProtoBuf","crunchR")
3) R CMD javareconf
3) add result of R --vanilla <<< 'system.file("jri", package="rJava") to
either mapred command lines or LD_LIBRARY_PATH...

but it will depend on their versions of hadoop, jre etc. I hoped crunch
might have something to hide a lot of that complexity (since it is about
hiding complexities, for the most part :)  ) besides hadoop has a way to
ship .so's to the backend so if crunch had an api to do something similar
it is conceivable that driver might yank and ship it too to hide that
complexity as well. But then there's a host of issues how to handle
potentially different rJava versions installed on different nodes... So, it
increasingly looks like something we might want to defer to sysops to do
with approximate set of requirements .


On Mon, Nov 12, 2012 at 5:29 PM, Josh Wills <[email protected]> wrote:

> On Mon, Nov 12, 2012 at 5:17 PM, Dmitriy Lyubimov <[email protected]>
> wrote:
>
> > so java tasks need to be able to load libjri.so from
> > whatever system.file("jri", package="rJava") says.
> >
> > Traditionally, these issues were handled with -Djava.library.path.
> > Apparently there's nothing java task can do to enable loadLibrary()
> command
> > to see the damn library once started. But -Djava.library.path requires
> for
> > nodes to configure and lock jvm command line from modifications of the
> > client.  which is fine.
> >
> > I also discovered that LD_LIBRARY_PATH actually works with jre 1.6
> (again).
> >
> > but... any other suggestions about best practice configuring crunch to
> run
> > user's .so's?
> >
>
> Not off the top of my head. I suspect that whatever you come up with will
> become the "best practice." :)
>
> >
> > thanks.
> >
> >
> >
> >
> >
> >
> > On Sun, Nov 11, 2012 at 1:41 PM, Josh Wills <[email protected]>
> wrote:
> >
> > > I believe that is a safe assumption, at least right now.
> > >
> > >
> > > On Sun, Nov 11, 2012 at 1:38 PM, Dmitriy Lyubimov <[email protected]>
> > > wrote:
> > >
> > > > Question.
> > > >
> > > > So in Crunch api, initialize() doesn't get an emitter. and the
> process
> > > gets
> > > > emitter every time.
> > > >
> > > > However, my guess any single reincranation of a DoFn object in the
> > > backend
> > > > will always be getting the same emitter thru its lifecycle. Is it an
> > > > admissible assumption or there's currently a counter example to that?
> > > >
> > > > The problem is that as i implement the two way pipeline of input and
> > > > emitter data between R and Java, I am bulking these calls together
> for
> > > > performance reasons. Each individual datum in these chunks of data
> will
> > > not
> > > > have attached emitter function information to them in any way. (well
> it
> > > > could but it would be a performance killer and i bet emitter never
> > > > changes).
> > > >
> > > > So, thoughts? can i assume emitter never changes between first and
> lass
> > > > call to DoFn instance?
> > > >
> > > > thanks.
> > > >
> > > >
> > > > On Mon, Oct 29, 2012 at 6:32 PM, Dmitriy Lyubimov <[email protected]
> >
> > > > wrote:
> > > >
> > > > > yes...
> > > > >
> > > > > i think it worked for me before, although just adding all jars
> from R
> > > > > package distribution would be a little bit more appropriate
> approach
> > > > > -- but it creates a problem with jars in dependent R packages. I
> > think
> > > > > it would be much easier to just compile a hadoop-job file and stick
> > it
> > > > > in rather than doing cherry-picking of individual jars from who
> knows
> > > > > how many locations.
> > > > >
> > > > > i think i used the hadoop job format with distributed cache before
> > and
> > > > > it worked... at least with Pig "register jar" functionality.
> > > > >
> > > > > ok i guess i will just try if it works.
> > > > >
> > > > > On Mon, Oct 29, 2012 at 6:24 PM, Josh Wills <[email protected]>
> > > wrote:
> > > > > > On Mon, Oct 29, 2012 at 5:46 PM, Dmitriy Lyubimov <
> > [email protected]
> > > >
> > > > > wrote:
> > > > > >
> > > > > >> Great! so it is in Crunch.
> > > > > >>
> > > > > >> does it support hadoop-job jar format or only pure java jars?
> > > > > >>
> > > > > >
> > > > > > I think just pure jars-- you're referring to hadoop-job format as
> > > > having
> > > > > > all the dependencies in a lib/ directory within the jar?
> > > > > >
> > > > > >
> > > > > >>
> > > > > >> On Mon, Oct 29, 2012 at 5:10 PM, Josh Wills <
> [email protected]>
> > > > > wrote:
> > > > > >> > On Mon, Oct 29, 2012 at 5:04 PM, Dmitriy Lyubimov <
> > > > [email protected]>
> > > > > >> wrote:
> > > > > >> >
> > > > > >> >> I think i need functionality to add more jars (or external
> > > > > hadoop-jar)
> > > > > >> >> to drive that from an R package. Just setting job jar by
> class
> > is
> > > > not
> > > > > >> >> enough. I can push overall job-jar as an addiitonal jar to R
> > > > package;
> > > > > >> >> however, i cannot really run hadoop command line on it, i
> need
> > to
> > > > set
> > > > > >> >> up classpath thru RJava.
> > > > > >> >>
> > > > > >> >> Traditional single hadoop job jar will unlikely work here
> since
> > > we
> > > > > >> >> cannot hardcode pipelines in java code but rather have to
> > > construct
> > > > > >> >> them on the fly. (well, we could serialize pipeline
> definitions
> > > > from
> > > > > R
> > > > > >> >> and then replay them in a driver -- but that's too cumbersome
> > and
> > > > > more
> > > > > >> >> work than it has to be.) There's no reason why i shouldn't be
> > > able
> > > > to
> > > > > >> >> do pig-like "register jar" or "setJobJar" (mahout-like) when
> > > > kicking
> > > > > >> >> off a pipeline.
> > > > > >> >>
> > > > > >> >
> > > > > >> > o.a.c.util.DistCache.addJarToDistributedCache?
> > > > > >> >
> > > > > >> >
> > > > > >> >>
> > > > > >> >>
> > > > > >> >> On Mon, Oct 29, 2012 at 10:17 AM, Dmitriy Lyubimov <
> > > > > [email protected]>
> > > > > >> >> wrote:
> > > > > >> >> > Ok, sounds very promising...
> > > > > >> >> >
> > > > > >> >> > i'll try to start digging on the driver part this week then
> > > > > (Pipeline
> > > > > >> >> > wrapper in R5).
> > > > > >> >> >
> > > > > >> >> > On Sun, Oct 28, 2012 at 11:56 AM, Josh Wills <
> > > > [email protected]
> > > > > >
> > > > > >> >> wrote:
> > > > > >> >> >> On Fri, Oct 26, 2012 at 2:40 PM, Dmitriy Lyubimov <
> > > > > [email protected]
> > > > > >> >
> > > > > >> >> wrote:
> > > > > >> >> >>> Ok, cool.
> > > > > >> >> >>>
> > > > > >> >> >>> So what state is Crunch in? I take it is in a fairly
> > advanced
> > > > > state.
> > > > > >> >> >>> So every api mentioned in the  FlumeJava paper is
> working ,
> > > > > right?
> > > > > >> Or
> > > > > >> >> >>> there's something that is not working specifically?
> > > > > >> >> >>
> > > > > >> >> >> I think the only thing in the paper that we don't have in
> a
> > > > > working
> > > > > >> >> >> state is MSCR fusion. It's mostly just a question of
> > > > prioritizing
> > > > > it
> > > > > >> >> >> and getting the work done.
> > > > > >> >> >>
> > > > > >> >> >>>
> > > > > >> >> >>> On Fri, Oct 26, 2012 at 2:31 PM, Josh Wills <
> > > > [email protected]
> > > > > >
> > > > > >> >> wrote:
> > > > > >> >> >>>> Hey Dmitriy,
> > > > > >> >> >>>>
> > > > > >> >> >>>> Got a fork going and looking forward to playing with
> > crunchR
> > > > > this
> > > > > >> >> weekend--
> > > > > >> >> >>>> thanks!
> > > > > >> >> >>>>
> > > > > >> >> >>>> J
> > > > > >> >> >>>>
> > > > > >> >> >>>> On Wed, Oct 24, 2012 at 1:28 PM, Dmitriy Lyubimov <
> > > > > >> [email protected]>
> > > > > >> >> wrote:
> > > > > >> >> >>>>
> > > > > >> >> >>>>> Project template https://github.com/dlyubimov/crunchR
> > > > > >> >> >>>>>
> > > > > >> >> >>>>> Default profile does not compile R artifact . R profile
> > > > > compiles R
> > > > > >> >> >>>>> artifact. for convenience, it is enabled by supplying
> -DR
> > > to
> > > > > mvn
> > > > > >> >> >>>>> command line, e.g.
> > > > > >> >> >>>>>
> > > > > >> >> >>>>> mvn install -DR
> > > > > >> >> >>>>>
> > > > > >> >> >>>>> there's also a helper that installs the snapshot
> version
> > of
> > > > the
> > > > > >> >> >>>>> package in the crunchR module.
> > > > > >> >> >>>>>
> > > > > >> >> >>>>> There's RJava and JRI java dependencies which i did not
> > > find
> > > > > >> anywhere
> > > > > >> >> >>>>> in public maven repos; so it is installed into my
> github
> > > > maven
> > > > > >> repo
> > > > > >> >> so
> > > > > >> >> >>>>> far. Should compile for 3rd party.
> > > > > >> >> >>>>>
> > > > > >> >> >>>>> -DR compilation requires R, RJava and optionally,
> > > RProtoBuf.
> > > > R
> > > > > Doc
> > > > > >> >> >>>>> compilation requires roxygen2 (i think).
> > > > > >> >> >>>>>
> > > > > >> >> >>>>> For some reason RProtoBuf fails to import into another
> > > > package,
> > > > > >> got a
> > > > > >> >> >>>>> weird exception when i put @import RProtoBuf into
> > crunchR,
> > > so
> > > > > >> >> >>>>> RProtoBuf is now in "Suggests" category. Down the road
> > that
> > > > may
> > > > > >> be a
> > > > > >> >> >>>>> problem though...
> > > > > >> >> >>>>>
> > > > > >> >> >>>>> other than the template, not much else has been done so
> > > > far...
> > > > > >> >> finding
> > > > > >> >> >>>>> hadoop libraries and adding it to the package path on
> > > > > >> initialization
> > > > > >> >> >>>>> via "hadoop classpath"... adding Crunch jars and its
> > > > > >> non-"provided"
> > > > > >> >> >>>>> transitives to the crunchR's java part...
> > > > > >> >> >>>>>
> > > > > >> >> >>>>> No legal stuff...
> > > > > >> >> >>>>>
> > > > > >> >> >>>>> No readmes... complete stealth at this point.
> > > > > >> >> >>>>>
> > > > > >> >> >>>>> On Thu, Oct 18, 2012 at 12:35 PM, Dmitriy Lyubimov <
> > > > > >> >> [email protected]>
> > > > > >> >> >>>>> wrote:
> > > > > >> >> >>>>> > Ok, cool. I will try to roll project template by some
> > > time
> > > > > next
> > > > > >> >> week.
> > > > > >> >> >>>>> > we can start with prototyping and benchmarking
> > something
> > > > > really
> > > > > >> >> >>>>> > simple, such as parallelDo().
> > > > > >> >> >>>>> >
> > > > > >> >> >>>>> > My interim goal is to perhaps take some more or less
> > > simple
> > > > > >> >> algorithm
> > > > > >> >> >>>>> > from Mahout and demonstrate it can be solved with
> > Rcrunch
> > > > (or
> > > > > >> >> whatever
> > > > > >> >> >>>>> > name it has to be) in a comparable time (performance)
> > but
> > > > > with
> > > > > >> much
> > > > > >> >> >>>>> > fewer lines of code. (say one of factorization or
> > > > clustering
> > > > > >> >> things)
> > > > > >> >> >>>>> >
> > > > > >> >> >>>>> >
> > > > > >> >> >>>>> > On Wed, Oct 17, 2012 at 10:24 PM, Rahul <
> > > [email protected]
> > > > >
> > > > > >> wrote:
> > > > > >> >> >>>>> >> I am not much of R user but I am interested to see
> how
> > > > well
> > > > > we
> > > > > >> can
> > > > > >> >> >>>>> integrate
> > > > > >> >> >>>>> >> the two. I would be happy to help.
> > > > > >> >> >>>>> >>
> > > > > >> >> >>>>> >> regards,
> > > > > >> >> >>>>> >> Rahul
> > > > > >> >> >>>>> >>
> > > > > >> >> >>>>> >> On 18-10-2012 04:04, Josh Wills wrote:
> > > > > >> >> >>>>> >>>
> > > > > >> >> >>>>> >>> On Wed, Oct 17, 2012 at 3:07 PM, Dmitriy Lyubimov <
> > > > > >> >> [email protected]>
> > > > > >> >> >>>>> >>> wrote:
> > > > > >> >> >>>>> >>>>
> > > > > >> >> >>>>> >>>> Yep, ok.
> > > > > >> >> >>>>> >>>>
> > > > > >> >> >>>>> >>>> I imagine it has to be an R module so I can set
> up a
> > > > maven
> > > > > >> >> project
> > > > > >> >> >>>>> >>>> with java/R code tree (I have been doing that a
> lot
> > > > > lately).
> > > > > >> Or
> > > > > >> >> if you
> > > > > >> >> >>>>> >>>> have a template to look at, it would be useful i
> > guess
> > > > > too.
> > > > > >> >> >>>>> >>>
> > > > > >> >> >>>>> >>> No, please go right ahead.
> > > > > >> >> >>>>> >>>
> > > > > >> >> >>>>> >>>>
> > > > > >> >> >>>>> >>>> On Wed, Oct 17, 2012 at 3:02 PM, Josh Wills <
> > > > > >> >> [email protected]>
> > > > > >> >> >>>>> wrote:
> > > > > >> >> >>>>> >>>>>
> > > > > >> >> >>>>> >>>>> I'd like it to be separate at first, but I am
> happy
> > > to
> > > > > help.
> > > > > >> >> Github
> > > > > >> >> >>>>> >>>>> repo?
> > > > > >> >> >>>>> >>>>> On Oct 17, 2012 2:57 PM, "Dmitriy Lyubimov" <
> > > > > >> [email protected]
> > > > > >> >> >
> > > > > >> >> >>>>> wrote:
> > > > > >> >> >>>>> >>>>>
> > > > > >> >> >>>>> >>>>>> Ok maybe there's a benefit to try a JRI/RJava
> > > > prototype
> > > > > on
> > > > > >> >> top of
> > > > > >> >> >>>>> >>>>>> Crunch for something simple. This should both
> save
> > > > time
> > > > > and
> > > > > >> >> prove or
> > > > > >> >> >>>>> >>>>>> disprove if Crunch via RJava integration is
> > viable.
> > > > > >> >> >>>>> >>>>>>
> > > > > >> >> >>>>> >>>>>> On my part i can try to do it within Crunch
> > > framework
> > > > > or we
> > > > > >> >> can keep
> > > > > >> >> >>>>> >>>>>> it completely separate.
> > > > > >> >> >>>>> >>>>>>
> > > > > >> >> >>>>> >>>>>> -d
> > > > > >> >> >>>>> >>>>>>
> > > > > >> >> >>>>> >>>>>> On Wed, Oct 17, 2012 at 2:08 PM, Josh Wills <
> > > > > >> >> [email protected]>
> > > > > >> >> >>>>> >>>>>> wrote:
> > > > > >> >> >>>>> >>>>>>>
> > > > > >> >> >>>>> >>>>>>> I am an avid R user and would be into it-- who
> > gave
> > > > the
> > > > > >> >> talk? Was
> > > > > >> >> >>>>> it
> > > > > >> >> >>>>> >>>>>>> Murray Stokely?
> > > > > >> >> >>>>> >>>>>>>
> > > > > >> >> >>>>> >>>>>>> On Wed, Oct 17, 2012 at 2:05 PM, Dmitriy
> > Lyubimov <
> > > > > >> >> >>>>> [email protected]>
> > > > > >> >> >>>>> >>>>>>
> > > > > >> >> >>>>> >>>>>> wrote:
> > > > > >> >> >>>>> >>>>>>>>
> > > > > >> >> >>>>> >>>>>>>> Hello,
> > > > > >> >> >>>>> >>>>>>>>
> > > > > >> >> >>>>> >>>>>>>> I was pretty excited to learn of Google's
> > > experience
> > > > > of R
> > > > > >> >> mapping
> > > > > >> >> >>>>> of
> > > > > >> >> >>>>> >>>>>>>> flume java on one of recent BARUGs. I think a
> > lot
> > > of
> > > > > >> >> applications
> > > > > >> >> >>>>> >>>>>>>> similar to what we do in Mahout could be
> > > prototyped
> > > > > using
> > > > > >> >> flume R.
> > > > > >> >> >>>>> >>>>>>>>
> > > > > >> >> >>>>> >>>>>>>> I did not quite get the details of Google
> > > > > implementation
> > > > > >> of
> > > > > >> >> R
> > > > > >> >> >>>>> >>>>>>>> mapping,
> > > > > >> >> >>>>> >>>>>>>> but i am not sure if just a direct mapping
> from
> > R
> > > to
> > > > > >> Crunch
> > > > > >> >> would
> > > > > >> >> >>>>> be
> > > > > >> >> >>>>> >>>>>>>> sufficient (and, for most part, efficient).
> > > > RJava/JRI
> > > > > and
> > > > > >> >> jni
> > > > > >> >> >>>>> seem to
> > > > > >> >> >>>>> >>>>>>>> be a pretty terrible performer to do that
> > > directly.
> > > > > >> >> >>>>> >>>>>>>>
> > > > > >> >> >>>>> >>>>>>>>
> > > > > >> >> >>>>> >>>>>>>> on top of it, I am thinknig if this project
> > could
> > > > > have a
> > > > > >> >> >>>>> contributed
> > > > > >> >> >>>>> >>>>>>>> adapter to Mahout's distributed matrices, that
> > > would
> > > > > be
> > > > > >> >> just a
> > > > > >> >> >>>>> very
> > > > > >> >> >>>>> >>>>>>>> good synergy.
> > > > > >> >> >>>>> >>>>>>>>
> > > > > >> >> >>>>> >>>>>>>> Is there anyone interested in
> > > contributing/advising
> > > > > for
> > > > > >> open
> > > > > >> >> >>>>> source
> > > > > >> >> >>>>> >>>>>>>> version of flume R support? Just gauging
> > interest,
> > > > > Crunch
> > > > > >> >> list
> > > > > >> >> >>>>> seems
> > > > > >> >> >>>>> >>>>>>>> like a natural place to poke.
> > > > > >> >> >>>>> >>>>>>>>
> > > > > >> >> >>>>> >>>>>>>> Thanks .
> > > > > >> >> >>>>> >>>>>>>>
> > > > > >> >> >>>>> >>>>>>>> -Dmitriy
> > > > > >> >> >>>>> >>>>>>>
> > > > > >> >> >>>>> >>>>>>>
> > > > > >> >> >>>>> >>>>>>>
> > > > > >> >> >>>>> >>>>>>> --
> > > > > >> >> >>>>> >>>>>>> Director of Data Science
> > > > > >> >> >>>>> >>>>>>> Cloudera
> > > > > >> >> >>>>> >>>>>>> Twitter: @josh_wills
> > > > > >> >> >>>>> >>>
> > > > > >> >> >>>>> >>>
> > > > > >> >> >>>>> >>>
> > > > > >> >> >>>>> >>
> > > > > >> >> >>>>>
> > > > > >> >> >>>>
> > > > > >> >> >>>>
> > > > > >> >> >>>>
> > > > > >> >> >>>> --
> > > > > >> >> >>>> Director of Data Science
> > > > > >> >> >>>> Cloudera <http://www.cloudera.com>
> > > > > >> >> >>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
> > > > > >> >>
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> > --
> > > > > >> > Director of Data Science
> > > > > >> > Cloudera <http://www.cloudera.com>
> > > > > >> > Twitter: @josh_wills <http://twitter.com/josh_wills>
> > > > > >>
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Director of Data Science
> > > > > > Cloudera <http://www.cloudera.com>
> > > > > > Twitter: @josh_wills <http://twitter.com/josh_wills>
> > > > >
> > > >
> > >
> >
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Re: Flume R -- any interest?

Reply via email to