Re: Flume R -- any interest?

Dmitriy Lyubimov Mon, 12 Nov 2012 17:41:49 -0800

for hadoop nodes i guess yet another option to soft-link the .so into
hadoop's native lib folder



On Mon, Nov 12, 2012 at 5:37 PM, Dmitriy Lyubimov <[email protected]> wrote:

> I actually want to defer this to hadoop admins, we just need to create a
> procedure for setting up nodes. Ideally as simple as possible. something
> like
>
> 1) setup R
> 2) install.packages("rJava","RProtoBuf","crunchR")
> 3) R CMD javareconf
> 3) add result of R --vanilla <<< 'system.file("jri", package="rJava") to
> either mapred command lines or LD_LIBRARY_PATH...
>
> but it will depend on their versions of hadoop, jre etc. I hoped crunch
> might have something to hide a lot of that complexity (since it is about
> hiding complexities, for the most part :)  ) besides hadoop has a way to
> ship .so's to the backend so if crunch had an api to do something similar
> it is conceivable that driver might yank and ship it too to hide that
> complexity as well. But then there's a host of issues how to handle
> potentially different rJava versions installed on different nodes... So, it
> increasingly looks like something we might want to defer to sysops to do
> with approximate set of requirements .
>
>
> On Mon, Nov 12, 2012 at 5:29 PM, Josh Wills <[email protected]> wrote:
>
>> On Mon, Nov 12, 2012 at 5:17 PM, Dmitriy Lyubimov <[email protected]>
>> wrote:
>>
>> > so java tasks need to be able to load libjri.so from
>> > whatever system.file("jri", package="rJava") says.
>> >
>> > Traditionally, these issues were handled with -Djava.library.path.
>> > Apparently there's nothing java task can do to enable loadLibrary()
>> command
>> > to see the damn library once started. But -Djava.library.path requires
>> for
>> > nodes to configure and lock jvm command line from modifications of the
>> > client.  which is fine.
>> >
>> > I also discovered that LD_LIBRARY_PATH actually works with jre 1.6
>> (again).
>> >
>> > but... any other suggestions about best practice configuring crunch to
>> run
>> > user's .so's?
>> >
>>
>> Not off the top of my head. I suspect that whatever you come up with will
>> become the "best practice." :)
>>
>> >
>> > thanks.
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Sun, Nov 11, 2012 at 1:41 PM, Josh Wills <[email protected]>
>> wrote:
>> >
>> > > I believe that is a safe assumption, at least right now.
>> > >
>> > >
>> > > On Sun, Nov 11, 2012 at 1:38 PM, Dmitriy Lyubimov <[email protected]>
>> > > wrote:
>> > >
>> > > > Question.
>> > > >
>> > > > So in Crunch api, initialize() doesn't get an emitter. and the
>> process
>> > > gets
>> > > > emitter every time.
>> > > >
>> > > > However, my guess any single reincranation of a DoFn object in the
>> > > backend
>> > > > will always be getting the same emitter thru its lifecycle. Is it an
>> > > > admissible assumption or there's currently a counter example to
>> that?
>> > > >
>> > > > The problem is that as i implement the two way pipeline of input and
>> > > > emitter data between R and Java, I am bulking these calls together
>> for
>> > > > performance reasons. Each individual datum in these chunks of data
>> will
>> > > not
>> > > > have attached emitter function information to them in any way.
>> (well it
>> > > > could but it would be a performance killer and i bet emitter never
>> > > > changes).
>> > > >
>> > > > So, thoughts? can i assume emitter never changes between first and
>> lass
>> > > > call to DoFn instance?
>> > > >
>> > > > thanks.
>> > > >
>> > > >
>> > > > On Mon, Oct 29, 2012 at 6:32 PM, Dmitriy Lyubimov <
>> [email protected]>
>> > > > wrote:
>> > > >
>> > > > > yes...
>> > > > >
>> > > > > i think it worked for me before, although just adding all jars
>> from R
>> > > > > package distribution would be a little bit more appropriate
>> approach
>> > > > > -- but it creates a problem with jars in dependent R packages. I
>> > think
>> > > > > it would be much easier to just compile a hadoop-job file and
>> stick
>> > it
>> > > > > in rather than doing cherry-picking of individual jars from who
>> knows
>> > > > > how many locations.
>> > > > >
>> > > > > i think i used the hadoop job format with distributed cache before
>> > and
>> > > > > it worked... at least with Pig "register jar" functionality.
>> > > > >
>> > > > > ok i guess i will just try if it works.
>> > > > >
>> > > > > On Mon, Oct 29, 2012 at 6:24 PM, Josh Wills <[email protected]>
>> > > wrote:
>> > > > > > On Mon, Oct 29, 2012 at 5:46 PM, Dmitriy Lyubimov <
>> > [email protected]
>> > > >
>> > > > > wrote:
>> > > > > >
>> > > > > >> Great! so it is in Crunch.
>> > > > > >>
>> > > > > >> does it support hadoop-job jar format or only pure java jars?
>> > > > > >>
>> > > > > >
>> > > > > > I think just pure jars-- you're referring to hadoop-job format
>> as
>> > > > having
>> > > > > > all the dependencies in a lib/ directory within the jar?
>> > > > > >
>> > > > > >
>> > > > > >>
>> > > > > >> On Mon, Oct 29, 2012 at 5:10 PM, Josh Wills <
>> [email protected]>
>> > > > > wrote:
>> > > > > >> > On Mon, Oct 29, 2012 at 5:04 PM, Dmitriy Lyubimov <
>> > > > [email protected]>
>> > > > > >> wrote:
>> > > > > >> >
>> > > > > >> >> I think i need functionality to add more jars (or external
>> > > > > hadoop-jar)
>> > > > > >> >> to drive that from an R package. Just setting job jar by
>> class
>> > is
>> > > > not
>> > > > > >> >> enough. I can push overall job-jar as an addiitonal jar to R
>> > > > package;
>> > > > > >> >> however, i cannot really run hadoop command line on it, i
>> need
>> > to
>> > > > set
>> > > > > >> >> up classpath thru RJava.
>> > > > > >> >>
>> > > > > >> >> Traditional single hadoop job jar will unlikely work here
>> since
>> > > we
>> > > > > >> >> cannot hardcode pipelines in java code but rather have to
>> > > construct
>> > > > > >> >> them on the fly. (well, we could serialize pipeline
>> definitions
>> > > > from
>> > > > > R
>> > > > > >> >> and then replay them in a driver -- but that's too
>> cumbersome
>> > and
>> > > > > more
>> > > > > >> >> work than it has to be.) There's no reason why i shouldn't
>> be
>> > > able
>> > > > to
>> > > > > >> >> do pig-like "register jar" or "setJobJar" (mahout-like) when
>> > > > kicking
>> > > > > >> >> off a pipeline.
>> > > > > >> >>
>> > > > > >> >
>> > > > > >> > o.a.c.util.DistCache.addJarToDistributedCache?
>> > > > > >> >
>> > > > > >> >
>> > > > > >> >>
>> > > > > >> >>
>> > > > > >> >> On Mon, Oct 29, 2012 at 10:17 AM, Dmitriy Lyubimov <
>> > > > > [email protected]>
>> > > > > >> >> wrote:
>> > > > > >> >> > Ok, sounds very promising...
>> > > > > >> >> >
>> > > > > >> >> > i'll try to start digging on the driver part this week
>> then
>> > > > > (Pipeline
>> > > > > >> >> > wrapper in R5).
>> > > > > >> >> >
>> > > > > >> >> > On Sun, Oct 28, 2012 at 11:56 AM, Josh Wills <
>> > > > [email protected]
>> > > > > >
>> > > > > >> >> wrote:
>> > > > > >> >> >> On Fri, Oct 26, 2012 at 2:40 PM, Dmitriy Lyubimov <
>> > > > > [email protected]
>> > > > > >> >
>> > > > > >> >> wrote:
>> > > > > >> >> >>> Ok, cool.
>> > > > > >> >> >>>
>> > > > > >> >> >>> So what state is Crunch in? I take it is in a fairly
>> > advanced
>> > > > > state.
>> > > > > >> >> >>> So every api mentioned in the  FlumeJava paper is
>> working ,
>> > > > > right?
>> > > > > >> Or
>> > > > > >> >> >>> there's something that is not working specifically?
>> > > > > >> >> >>
>> > > > > >> >> >> I think the only thing in the paper that we don't have
>> in a
>> > > > > working
>> > > > > >> >> >> state is MSCR fusion. It's mostly just a question of
>> > > > prioritizing
>> > > > > it
>> > > > > >> >> >> and getting the work done.
>> > > > > >> >> >>
>> > > > > >> >> >>>
>> > > > > >> >> >>> On Fri, Oct 26, 2012 at 2:31 PM, Josh Wills <
>> > > > [email protected]
>> > > > > >
>> > > > > >> >> wrote:
>> > > > > >> >> >>>> Hey Dmitriy,
>> > > > > >> >> >>>>
>> > > > > >> >> >>>> Got a fork going and looking forward to playing with
>> > crunchR
>> > > > > this
>> > > > > >> >> weekend--
>> > > > > >> >> >>>> thanks!
>> > > > > >> >> >>>>
>> > > > > >> >> >>>> J
>> > > > > >> >> >>>>
>> > > > > >> >> >>>> On Wed, Oct 24, 2012 at 1:28 PM, Dmitriy Lyubimov <
>> > > > > >> [email protected]>
>> > > > > >> >> wrote:
>> > > > > >> >> >>>>
>> > > > > >> >> >>>>> Project template https://github.com/dlyubimov/crunchR
>> > > > > >> >> >>>>>
>> > > > > >> >> >>>>> Default profile does not compile R artifact . R
>> profile
>> > > > > compiles R
>> > > > > >> >> >>>>> artifact. for convenience, it is enabled by supplying
>> -DR
>> > > to
>> > > > > mvn
>> > > > > >> >> >>>>> command line, e.g.
>> > > > > >> >> >>>>>
>> > > > > >> >> >>>>> mvn install -DR
>> > > > > >> >> >>>>>
>> > > > > >> >> >>>>> there's also a helper that installs the snapshot
>> version
>> > of
>> > > > the
>> > > > > >> >> >>>>> package in the crunchR module.
>> > > > > >> >> >>>>>
>> > > > > >> >> >>>>> There's RJava and JRI java dependencies which i did
>> not
>> > > find
>> > > > > >> anywhere
>> > > > > >> >> >>>>> in public maven repos; so it is installed into my
>> github
>> > > > maven
>> > > > > >> repo
>> > > > > >> >> so
>> > > > > >> >> >>>>> far. Should compile for 3rd party.
>> > > > > >> >> >>>>>
>> > > > > >> >> >>>>> -DR compilation requires R, RJava and optionally,
>> > > RProtoBuf.
>> > > > R
>> > > > > Doc
>> > > > > >> >> >>>>> compilation requires roxygen2 (i think).
>> > > > > >> >> >>>>>
>> > > > > >> >> >>>>> For some reason RProtoBuf fails to import into another
>> > > > package,
>> > > > > >> got a
>> > > > > >> >> >>>>> weird exception when i put @import RProtoBuf into
>> > crunchR,
>> > > so
>> > > > > >> >> >>>>> RProtoBuf is now in "Suggests" category. Down the road
>> > that
>> > > > may
>> > > > > >> be a
>> > > > > >> >> >>>>> problem though...
>> > > > > >> >> >>>>>
>> > > > > >> >> >>>>> other than the template, not much else has been done
>> so
>> > > > far...
>> > > > > >> >> finding
>> > > > > >> >> >>>>> hadoop libraries and adding it to the package path on
>> > > > > >> initialization
>> > > > > >> >> >>>>> via "hadoop classpath"... adding Crunch jars and its
>> > > > > >> non-"provided"
>> > > > > >> >> >>>>> transitives to the crunchR's java part...
>> > > > > >> >> >>>>>
>> > > > > >> >> >>>>> No legal stuff...
>> > > > > >> >> >>>>>
>> > > > > >> >> >>>>> No readmes... complete stealth at this point.
>> > > > > >> >> >>>>>
>> > > > > >> >> >>>>> On Thu, Oct 18, 2012 at 12:35 PM, Dmitriy Lyubimov <
>> > > > > >> >> [email protected]>
>> > > > > >> >> >>>>> wrote:
>> > > > > >> >> >>>>> > Ok, cool. I will try to roll project template by
>> some
>> > > time
>> > > > > next
>> > > > > >> >> week.
>> > > > > >> >> >>>>> > we can start with prototyping and benchmarking
>> > something
>> > > > > really
>> > > > > >> >> >>>>> > simple, such as parallelDo().
>> > > > > >> >> >>>>> >
>> > > > > >> >> >>>>> > My interim goal is to perhaps take some more or less
>> > > simple
>> > > > > >> >> algorithm
>> > > > > >> >> >>>>> > from Mahout and demonstrate it can be solved with
>> > Rcrunch
>> > > > (or
>> > > > > >> >> whatever
>> > > > > >> >> >>>>> > name it has to be) in a comparable time
>> (performance)
>> > but
>> > > > > with
>> > > > > >> much
>> > > > > >> >> >>>>> > fewer lines of code. (say one of factorization or
>> > > > clustering
>> > > > > >> >> things)
>> > > > > >> >> >>>>> >
>> > > > > >> >> >>>>> >
>> > > > > >> >> >>>>> > On Wed, Oct 17, 2012 at 10:24 PM, Rahul <
>> > > [email protected]
>> > > > >
>> > > > > >> wrote:
>> > > > > >> >> >>>>> >> I am not much of R user but I am interested to see
>> how
>> > > > well
>> > > > > we
>> > > > > >> can
>> > > > > >> >> >>>>> integrate
>> > > > > >> >> >>>>> >> the two. I would be happy to help.
>> > > > > >> >> >>>>> >>
>> > > > > >> >> >>>>> >> regards,
>> > > > > >> >> >>>>> >> Rahul
>> > > > > >> >> >>>>> >>
>> > > > > >> >> >>>>> >> On 18-10-2012 04:04, Josh Wills wrote:
>> > > > > >> >> >>>>> >>>
>> > > > > >> >> >>>>> >>> On Wed, Oct 17, 2012 at 3:07 PM, Dmitriy Lyubimov
>> <
>> > > > > >> >> [email protected]>
>> > > > > >> >> >>>>> >>> wrote:
>> > > > > >> >> >>>>> >>>>
>> > > > > >> >> >>>>> >>>> Yep, ok.
>> > > > > >> >> >>>>> >>>>
>> > > > > >> >> >>>>> >>>> I imagine it has to be an R module so I can set
>> up a
>> > > > maven
>> > > > > >> >> project
>> > > > > >> >> >>>>> >>>> with java/R code tree (I have been doing that a
>> lot
>> > > > > lately).
>> > > > > >> Or
>> > > > > >> >> if you
>> > > > > >> >> >>>>> >>>> have a template to look at, it would be useful i
>> > guess
>> > > > > too.
>> > > > > >> >> >>>>> >>>
>> > > > > >> >> >>>>> >>> No, please go right ahead.
>> > > > > >> >> >>>>> >>>
>> > > > > >> >> >>>>> >>>>
>> > > > > >> >> >>>>> >>>> On Wed, Oct 17, 2012 at 3:02 PM, Josh Wills <
>> > > > > >> >> [email protected]>
>> > > > > >> >> >>>>> wrote:
>> > > > > >> >> >>>>> >>>>>
>> > > > > >> >> >>>>> >>>>> I'd like it to be separate at first, but I am
>> happy
>> > > to
>> > > > > help.
>> > > > > >> >> Github
>> > > > > >> >> >>>>> >>>>> repo?
>> > > > > >> >> >>>>> >>>>> On Oct 17, 2012 2:57 PM, "Dmitriy Lyubimov" <
>> > > > > >> [email protected]
>> > > > > >> >> >
>> > > > > >> >> >>>>> wrote:
>> > > > > >> >> >>>>> >>>>>
>> > > > > >> >> >>>>> >>>>>> Ok maybe there's a benefit to try a JRI/RJava
>> > > > prototype
>> > > > > on
>> > > > > >> >> top of
>> > > > > >> >> >>>>> >>>>>> Crunch for something simple. This should both
>> save
>> > > > time
>> > > > > and
>> > > > > >> >> prove or
>> > > > > >> >> >>>>> >>>>>> disprove if Crunch via RJava integration is
>> > viable.
>> > > > > >> >> >>>>> >>>>>>
>> > > > > >> >> >>>>> >>>>>> On my part i can try to do it within Crunch
>> > > framework
>> > > > > or we
>> > > > > >> >> can keep
>> > > > > >> >> >>>>> >>>>>> it completely separate.
>> > > > > >> >> >>>>> >>>>>>
>> > > > > >> >> >>>>> >>>>>> -d
>> > > > > >> >> >>>>> >>>>>>
>> > > > > >> >> >>>>> >>>>>> On Wed, Oct 17, 2012 at 2:08 PM, Josh Wills <
>> > > > > >> >> [email protected]>
>> > > > > >> >> >>>>> >>>>>> wrote:
>> > > > > >> >> >>>>> >>>>>>>
>> > > > > >> >> >>>>> >>>>>>> I am an avid R user and would be into it-- who
>> > gave
>> > > > the
>> > > > > >> >> talk? Was
>> > > > > >> >> >>>>> it
>> > > > > >> >> >>>>> >>>>>>> Murray Stokely?
>> > > > > >> >> >>>>> >>>>>>>
>> > > > > >> >> >>>>> >>>>>>> On Wed, Oct 17, 2012 at 2:05 PM, Dmitriy
>> > Lyubimov <
>> > > > > >> >> >>>>> [email protected]>
>> > > > > >> >> >>>>> >>>>>>
>> > > > > >> >> >>>>> >>>>>> wrote:
>> > > > > >> >> >>>>> >>>>>>>>
>> > > > > >> >> >>>>> >>>>>>>> Hello,
>> > > > > >> >> >>>>> >>>>>>>>
>> > > > > >> >> >>>>> >>>>>>>> I was pretty excited to learn of Google's
>> > > experience
>> > > > > of R
>> > > > > >> >> mapping
>> > > > > >> >> >>>>> of
>> > > > > >> >> >>>>> >>>>>>>> flume java on one of recent BARUGs. I think a
>> > lot
>> > > of
>> > > > > >> >> applications
>> > > > > >> >> >>>>> >>>>>>>> similar to what we do in Mahout could be
>> > > prototyped
>> > > > > using
>> > > > > >> >> flume R.
>> > > > > >> >> >>>>> >>>>>>>>
>> > > > > >> >> >>>>> >>>>>>>> I did not quite get the details of Google
>> > > > > implementation
>> > > > > >> of
>> > > > > >> >> R
>> > > > > >> >> >>>>> >>>>>>>> mapping,
>> > > > > >> >> >>>>> >>>>>>>> but i am not sure if just a direct mapping
>> from
>> > R
>> > > to
>> > > > > >> Crunch
>> > > > > >> >> would
>> > > > > >> >> >>>>> be
>> > > > > >> >> >>>>> >>>>>>>> sufficient (and, for most part, efficient).
>> > > > RJava/JRI
>> > > > > and
>> > > > > >> >> jni
>> > > > > >> >> >>>>> seem to
>> > > > > >> >> >>>>> >>>>>>>> be a pretty terrible performer to do that
>> > > directly.
>> > > > > >> >> >>>>> >>>>>>>>
>> > > > > >> >> >>>>> >>>>>>>>
>> > > > > >> >> >>>>> >>>>>>>> on top of it, I am thinknig if this project
>> > could
>> > > > > have a
>> > > > > >> >> >>>>> contributed
>> > > > > >> >> >>>>> >>>>>>>> adapter to Mahout's distributed matrices,
>> that
>> > > would
>> > > > > be
>> > > > > >> >> just a
>> > > > > >> >> >>>>> very
>> > > > > >> >> >>>>> >>>>>>>> good synergy.
>> > > > > >> >> >>>>> >>>>>>>>
>> > > > > >> >> >>>>> >>>>>>>> Is there anyone interested in
>> > > contributing/advising
>> > > > > for
>> > > > > >> open
>> > > > > >> >> >>>>> source
>> > > > > >> >> >>>>> >>>>>>>> version of flume R support? Just gauging
>> > interest,
>> > > > > Crunch
>> > > > > >> >> list
>> > > > > >> >> >>>>> seems
>> > > > > >> >> >>>>> >>>>>>>> like a natural place to poke.
>> > > > > >> >> >>>>> >>>>>>>>
>> > > > > >> >> >>>>> >>>>>>>> Thanks .
>> > > > > >> >> >>>>> >>>>>>>>
>> > > > > >> >> >>>>> >>>>>>>> -Dmitriy
>> > > > > >> >> >>>>> >>>>>>>
>> > > > > >> >> >>>>> >>>>>>>
>> > > > > >> >> >>>>> >>>>>>>
>> > > > > >> >> >>>>> >>>>>>> --
>> > > > > >> >> >>>>> >>>>>>> Director of Data Science
>> > > > > >> >> >>>>> >>>>>>> Cloudera
>> > > > > >> >> >>>>> >>>>>>> Twitter: @josh_wills
>> > > > > >> >> >>>>> >>>
>> > > > > >> >> >>>>> >>>
>> > > > > >> >> >>>>> >>>
>> > > > > >> >> >>>>> >>
>> > > > > >> >> >>>>>
>> > > > > >> >> >>>>
>> > > > > >> >> >>>>
>> > > > > >> >> >>>>
>> > > > > >> >> >>>> --
>> > > > > >> >> >>>> Director of Data Science
>> > > > > >> >> >>>> Cloudera <http://www.cloudera.com>
>> > > > > >> >> >>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>> > > > > >> >>
>> > > > > >> >
>> > > > > >> >
>> > > > > >> >
>> > > > > >> > --
>> > > > > >> > Director of Data Science
>> > > > > >> > Cloudera <http://www.cloudera.com>
>> > > > > >> > Twitter: @josh_wills <http://twitter.com/josh_wills>
>> > > > > >>
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > --
>> > > > > > Director of Data Science
>> > > > > > Cloudera <http://www.cloudera.com>
>> > > > > > Twitter: @josh_wills <http://twitter.com/josh_wills>
>> > > > >
>> > > >
>> > >
>> >
>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>
>

Re: Flume R -- any interest?

Reply via email to