Re: Flume R -- any interest?

Dmitriy Lyubimov Sun, 11 Nov 2012 13:39:16 -0800

Question.

So in Crunch api, initialize() doesn't get an emitter. and the process gets
emitter every time.


However, my guess any single reincranation of a DoFn object in the backend
will always be getting the same emitter thru its lifecycle. Is it an
admissible assumption or there's currently a counter example to that?

The problem is that as i implement the two way pipeline of input and
emitter data between R and Java, I am bulking these calls together for
performance reasons. Each individual datum in these chunks of data will not
have attached emitter function information to them in any way. (well it
could but it would be a performance killer and i bet emitter never
changes).

So, thoughts? can i assume emitter never changes between first and lass
call to DoFn instance?

thanks.


On Mon, Oct 29, 2012 at 6:32 PM, Dmitriy Lyubimov <[email protected]> wrote:

> yes...
>
> i think it worked for me before, although just adding all jars from R
> package distribution would be a little bit more appropriate approach
> -- but it creates a problem with jars in dependent R packages. I think
> it would be much easier to just compile a hadoop-job file and stick it
> in rather than doing cherry-picking of individual jars from who knows
> how many locations.
>
> i think i used the hadoop job format with distributed cache before and
> it worked... at least with Pig "register jar" functionality.
>
> ok i guess i will just try if it works.
>
> On Mon, Oct 29, 2012 at 6:24 PM, Josh Wills <[email protected]> wrote:
> > On Mon, Oct 29, 2012 at 5:46 PM, Dmitriy Lyubimov <[email protected]>
> wrote:
> >
> >> Great! so it is in Crunch.
> >>
> >> does it support hadoop-job jar format or only pure java jars?
> >>
> >
> > I think just pure jars-- you're referring to hadoop-job format as having
> > all the dependencies in a lib/ directory within the jar?
> >
> >
> >>
> >> On Mon, Oct 29, 2012 at 5:10 PM, Josh Wills <[email protected]>
> wrote:
> >> > On Mon, Oct 29, 2012 at 5:04 PM, Dmitriy Lyubimov <[email protected]>
> >> wrote:
> >> >
> >> >> I think i need functionality to add more jars (or external
> hadoop-jar)
> >> >> to drive that from an R package. Just setting job jar by class is not
> >> >> enough. I can push overall job-jar as an addiitonal jar to R package;
> >> >> however, i cannot really run hadoop command line on it, i need to set
> >> >> up classpath thru RJava.
> >> >>
> >> >> Traditional single hadoop job jar will unlikely work here since we
> >> >> cannot hardcode pipelines in java code but rather have to construct
> >> >> them on the fly. (well, we could serialize pipeline definitions from
> R
> >> >> and then replay them in a driver -- but that's too cumbersome and
> more
> >> >> work than it has to be.) There's no reason why i shouldn't be able to
> >> >> do pig-like "register jar" or "setJobJar" (mahout-like) when kicking
> >> >> off a pipeline.
> >> >>
> >> >
> >> > o.a.c.util.DistCache.addJarToDistributedCache?
> >> >
> >> >
> >> >>
> >> >>
> >> >> On Mon, Oct 29, 2012 at 10:17 AM, Dmitriy Lyubimov <
> [email protected]>
> >> >> wrote:
> >> >> > Ok, sounds very promising...
> >> >> >
> >> >> > i'll try to start digging on the driver part this week then
> (Pipeline
> >> >> > wrapper in R5).
> >> >> >
> >> >> > On Sun, Oct 28, 2012 at 11:56 AM, Josh Wills <[email protected]
> >
> >> >> wrote:
> >> >> >> On Fri, Oct 26, 2012 at 2:40 PM, Dmitriy Lyubimov <
> [email protected]
> >> >
> >> >> wrote:
> >> >> >>> Ok, cool.
> >> >> >>>
> >> >> >>> So what state is Crunch in? I take it is in a fairly advanced
> state.
> >> >> >>> So every api mentioned in the  FlumeJava paper is working ,
> right?
> >> Or
> >> >> >>> there's something that is not working specifically?
> >> >> >>
> >> >> >> I think the only thing in the paper that we don't have in a
> working
> >> >> >> state is MSCR fusion. It's mostly just a question of prioritizing
> it
> >> >> >> and getting the work done.
> >> >> >>
> >> >> >>>
> >> >> >>> On Fri, Oct 26, 2012 at 2:31 PM, Josh Wills <[email protected]
> >
> >> >> wrote:
> >> >> >>>> Hey Dmitriy,
> >> >> >>>>
> >> >> >>>> Got a fork going and looking forward to playing with crunchR
> this
> >> >> weekend--
> >> >> >>>> thanks!
> >> >> >>>>
> >> >> >>>> J
> >> >> >>>>
> >> >> >>>> On Wed, Oct 24, 2012 at 1:28 PM, Dmitriy Lyubimov <
> >> [email protected]>
> >> >> wrote:
> >> >> >>>>
> >> >> >>>>> Project template https://github.com/dlyubimov/crunchR
> >> >> >>>>>
> >> >> >>>>> Default profile does not compile R artifact . R profile
> compiles R
> >> >> >>>>> artifact. for convenience, it is enabled by supplying -DR to
> mvn
> >> >> >>>>> command line, e.g.
> >> >> >>>>>
> >> >> >>>>> mvn install -DR
> >> >> >>>>>
> >> >> >>>>> there's also a helper that installs the snapshot version of the
> >> >> >>>>> package in the crunchR module.
> >> >> >>>>>
> >> >> >>>>> There's RJava and JRI java dependencies which i did not find
> >> anywhere
> >> >> >>>>> in public maven repos; so it is installed into my github maven
> >> repo
> >> >> so
> >> >> >>>>> far. Should compile for 3rd party.
> >> >> >>>>>
> >> >> >>>>> -DR compilation requires R, RJava and optionally, RProtoBuf. R
> Doc
> >> >> >>>>> compilation requires roxygen2 (i think).
> >> >> >>>>>
> >> >> >>>>> For some reason RProtoBuf fails to import into another package,
> >> got a
> >> >> >>>>> weird exception when i put @import RProtoBuf into crunchR, so
> >> >> >>>>> RProtoBuf is now in "Suggests" category. Down the road that may
> >> be a
> >> >> >>>>> problem though...
> >> >> >>>>>
> >> >> >>>>> other than the template, not much else has been done so far...
> >> >> finding
> >> >> >>>>> hadoop libraries and adding it to the package path on
> >> initialization
> >> >> >>>>> via "hadoop classpath"... adding Crunch jars and its
> >> non-"provided"
> >> >> >>>>> transitives to the crunchR's java part...
> >> >> >>>>>
> >> >> >>>>> No legal stuff...
> >> >> >>>>>
> >> >> >>>>> No readmes... complete stealth at this point.
> >> >> >>>>>
> >> >> >>>>> On Thu, Oct 18, 2012 at 12:35 PM, Dmitriy Lyubimov <
> >> >> [email protected]>
> >> >> >>>>> wrote:
> >> >> >>>>> > Ok, cool. I will try to roll project template by some time
> next
> >> >> week.
> >> >> >>>>> > we can start with prototyping and benchmarking something
> really
> >> >> >>>>> > simple, such as parallelDo().
> >> >> >>>>> >
> >> >> >>>>> > My interim goal is to perhaps take some more or less simple
> >> >> algorithm
> >> >> >>>>> > from Mahout and demonstrate it can be solved with Rcrunch (or
> >> >> whatever
> >> >> >>>>> > name it has to be) in a comparable time (performance) but
> with
> >> much
> >> >> >>>>> > fewer lines of code. (say one of factorization or clustering
> >> >> things)
> >> >> >>>>> >
> >> >> >>>>> >
> >> >> >>>>> > On Wed, Oct 17, 2012 at 10:24 PM, Rahul <[email protected]>
> >> wrote:
> >> >> >>>>> >> I am not much of R user but I am interested to see how well
> we
> >> can
> >> >> >>>>> integrate
> >> >> >>>>> >> the two. I would be happy to help.
> >> >> >>>>> >>
> >> >> >>>>> >> regards,
> >> >> >>>>> >> Rahul
> >> >> >>>>> >>
> >> >> >>>>> >> On 18-10-2012 04:04, Josh Wills wrote:
> >> >> >>>>> >>>
> >> >> >>>>> >>> On Wed, Oct 17, 2012 at 3:07 PM, Dmitriy Lyubimov <
> >> >> [email protected]>
> >> >> >>>>> >>> wrote:
> >> >> >>>>> >>>>
> >> >> >>>>> >>>> Yep, ok.
> >> >> >>>>> >>>>
> >> >> >>>>> >>>> I imagine it has to be an R module so I can set up a maven
> >> >> project
> >> >> >>>>> >>>> with java/R code tree (I have been doing that a lot
> lately).
> >> Or
> >> >> if you
> >> >> >>>>> >>>> have a template to look at, it would be useful i guess
> too.
> >> >> >>>>> >>>
> >> >> >>>>> >>> No, please go right ahead.
> >> >> >>>>> >>>
> >> >> >>>>> >>>>
> >> >> >>>>> >>>> On Wed, Oct 17, 2012 at 3:02 PM, Josh Wills <
> >> >> [email protected]>
> >> >> >>>>> wrote:
> >> >> >>>>> >>>>>
> >> >> >>>>> >>>>> I'd like it to be separate at first, but I am happy to
> help.
> >> >> Github
> >> >> >>>>> >>>>> repo?
> >> >> >>>>> >>>>> On Oct 17, 2012 2:57 PM, "Dmitriy Lyubimov" <
> >> [email protected]
> >> >> >
> >> >> >>>>> wrote:
> >> >> >>>>> >>>>>
> >> >> >>>>> >>>>>> Ok maybe there's a benefit to try a JRI/RJava prototype
> on
> >> >> top of
> >> >> >>>>> >>>>>> Crunch for something simple. This should both save time
> and
> >> >> prove or
> >> >> >>>>> >>>>>> disprove if Crunch via RJava integration is viable.
> >> >> >>>>> >>>>>>
> >> >> >>>>> >>>>>> On my part i can try to do it within Crunch framework
> or we
> >> >> can keep
> >> >> >>>>> >>>>>> it completely separate.
> >> >> >>>>> >>>>>>
> >> >> >>>>> >>>>>> -d
> >> >> >>>>> >>>>>>
> >> >> >>>>> >>>>>> On Wed, Oct 17, 2012 at 2:08 PM, Josh Wills <
> >> >> [email protected]>
> >> >> >>>>> >>>>>> wrote:
> >> >> >>>>> >>>>>>>
> >> >> >>>>> >>>>>>> I am an avid R user and would be into it-- who gave the
> >> >> talk? Was
> >> >> >>>>> it
> >> >> >>>>> >>>>>>> Murray Stokely?
> >> >> >>>>> >>>>>>>
> >> >> >>>>> >>>>>>> On Wed, Oct 17, 2012 at 2:05 PM, Dmitriy Lyubimov <
> >> >> >>>>> [email protected]>
> >> >> >>>>> >>>>>>
> >> >> >>>>> >>>>>> wrote:
> >> >> >>>>> >>>>>>>>
> >> >> >>>>> >>>>>>>> Hello,
> >> >> >>>>> >>>>>>>>
> >> >> >>>>> >>>>>>>> I was pretty excited to learn of Google's experience
> of R
> >> >> mapping
> >> >> >>>>> of
> >> >> >>>>> >>>>>>>> flume java on one of recent BARUGs. I think a lot of
> >> >> applications
> >> >> >>>>> >>>>>>>> similar to what we do in Mahout could be prototyped
> using
> >> >> flume R.
> >> >> >>>>> >>>>>>>>
> >> >> >>>>> >>>>>>>> I did not quite get the details of Google
> implementation
> >> of
> >> >> R
> >> >> >>>>> >>>>>>>> mapping,
> >> >> >>>>> >>>>>>>> but i am not sure if just a direct mapping from R to
> >> Crunch
> >> >> would
> >> >> >>>>> be
> >> >> >>>>> >>>>>>>> sufficient (and, for most part, efficient). RJava/JRI
> and
> >> >> jni
> >> >> >>>>> seem to
> >> >> >>>>> >>>>>>>> be a pretty terrible performer to do that directly.
> >> >> >>>>> >>>>>>>>
> >> >> >>>>> >>>>>>>>
> >> >> >>>>> >>>>>>>> on top of it, I am thinknig if this project could
> have a
> >> >> >>>>> contributed
> >> >> >>>>> >>>>>>>> adapter to Mahout's distributed matrices, that would
> be
> >> >> just a
> >> >> >>>>> very
> >> >> >>>>> >>>>>>>> good synergy.
> >> >> >>>>> >>>>>>>>
> >> >> >>>>> >>>>>>>> Is there anyone interested in contributing/advising
> for
> >> open
> >> >> >>>>> source
> >> >> >>>>> >>>>>>>> version of flume R support? Just gauging interest,
> Crunch
> >> >> list
> >> >> >>>>> seems
> >> >> >>>>> >>>>>>>> like a natural place to poke.
> >> >> >>>>> >>>>>>>>
> >> >> >>>>> >>>>>>>> Thanks .
> >> >> >>>>> >>>>>>>>
> >> >> >>>>> >>>>>>>> -Dmitriy
> >> >> >>>>> >>>>>>>
> >> >> >>>>> >>>>>>>
> >> >> >>>>> >>>>>>>
> >> >> >>>>> >>>>>>> --
> >> >> >>>>> >>>>>>> Director of Data Science
> >> >> >>>>> >>>>>>> Cloudera
> >> >> >>>>> >>>>>>> Twitter: @josh_wills
> >> >> >>>>> >>>
> >> >> >>>>> >>>
> >> >> >>>>> >>>
> >> >> >>>>> >>
> >> >> >>>>>
> >> >> >>>>
> >> >> >>>>
> >> >> >>>>
> >> >> >>>> --
> >> >> >>>> Director of Data Science
> >> >> >>>> Cloudera <http://www.cloudera.com>
> >> >> >>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Director of Data Science
> >> > Cloudera <http://www.cloudera.com>
> >> > Twitter: @josh_wills <http://twitter.com/josh_wills>
> >>
> >
> >
> >
> > --
> > Director of Data Science
> > Cloudera <http://www.cloudera.com>
> > Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Re: Flume R -- any interest?

Reply via email to