Re: Flume R -- any interest?

Dmitriy Lyubimov Fri, 16 Nov 2012 15:08:42 -0800

Or RTNode? I guess i am not sure what difference is.

Bottom line, i need to do some task startup routines (e.g. establish
exchange queues between task and R) and also last thing cleanup before MR
tasks exits and _before all outputs are closed_. (kind of "flush all"
thing).


Thanks.
-d


On Fri, Nov 16, 2012 at 3:04 PM, Dmitriy Lyubimov <[email protected]> wrote:

> How do I hook into CrunchTaskContext to do a task cleanup (as opposed to a
> DoFn etc.) ?
>
>
> On Fri, Nov 16, 2012 at 2:52 PM, Dmitriy Lyubimov <[email protected]>wrote:
>
>> no it is fully distributed testing.
>>
>> It is ok, StatEt handles log4j logging for me so i see the logs. I was
>> wondering if any end-to-end diagnostics is already embedded in Crunch  but
>> reporting backend errors to front end is notoriously hard (and sometimes,
>> impossible) with hadoop, so I assume it doesn't make sense to report
>> client-only stuff thru exception while the other stuff still requires
>> checking isSucceeded().
>>
>>
>>
>> On Fri, Nov 16, 2012 at 11:07 AM, Josh Wills <[email protected]> wrote:
>>
>>> Are you running this using LocalJobRunner? Does calling
>>> Pipeline.enableDebug() before run() help? If it doesn't, it'll help
>>> settle a debate I'm having w/Matthias. ;-)
>>>
>>> On Fri, Nov 16, 2012 at 10:22 AM, Dmitriy Lyubimov <[email protected]>
>>> wrote:
>>> > I see the error in the logs but Pipeline.run() has never thrown
>>> anything.
>>> > isSucceeded() subsequently returns false. Is there any way to extract
>>> > client-side problem rather than just being able to state that job
>>> failed?
>>> > or it is ok and the only diagnostics by design?
>>> >
>>> > ============
>>> > 68124 [Thread-8] INFO  org.apache.crunch.impl.mr.exec.CrunchJob  -
>>> > org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path
>>> > does not exist: hdfs://localhost:11010/crunchr-example/input
>>> > at
>>> >
>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:231)
>>> > at
>>> >
>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:248)
>>> > at
>>> org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:944)
>>> > at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:961)
>>> > at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170)
>>> > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:880)
>>> > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833)
>>> > at java.security.AccessController.doPrivileged(Native Method)
>>> > at javax.security.auth.Subject.doAs(Subject.java:396)
>>> > at
>>> >
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
>>> > at
>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833)
>>> > at org.apache.hadoop.mapreduce.Job.submit(Job.java:476)
>>> > at
>>> >
>>> org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob.submit(CrunchControlledJob.java:331)
>>> > at org.apache.crunch.impl.mr.exec.CrunchJob.submit(CrunchJob.java:135)
>>> > at
>>> >
>>> org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.startReadyJobs(CrunchJobControl.java:251)
>>> > at
>>> >
>>> org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.run(CrunchJobControl.java:279)
>>> > at java.lang.Thread.run(Thread.java:662)
>>> >
>>> >
>>> > On Mon, Nov 12, 2012 at 5:41 PM, Dmitriy Lyubimov <[email protected]>
>>> wrote:
>>> >
>>> >> for hadoop nodes i guess yet another option to soft-link the .so into
>>> >> hadoop's native lib folder
>>> >>
>>> >>
>>> >> On Mon, Nov 12, 2012 at 5:37 PM, Dmitriy Lyubimov <[email protected]
>>> >wrote:
>>> >>
>>> >>> I actually want to defer this to hadoop admins, we just need to
>>> create a
>>> >>> procedure for setting up nodes. Ideally as simple as possible.
>>> something
>>> >>> like
>>> >>>
>>> >>> 1) setup R
>>> >>> 2) install.packages("rJava","RProtoBuf","crunchR")
>>> >>> 3) R CMD javareconf
>>> >>> 3) add result of R --vanilla <<< 'system.file("jri",
>>> package="rJava") to
>>> >>> either mapred command lines or LD_LIBRARY_PATH...
>>> >>>
>>> >>> but it will depend on their versions of hadoop, jre etc. I hoped
>>> crunch
>>> >>> might have something to hide a lot of that complexity (since it is
>>> about
>>> >>> hiding complexities, for the most part :)  ) besides hadoop has a
>>> way to
>>> >>> ship .so's to the backend so if crunch had an api to do something
>>> similar
>>> >>> it is conceivable that driver might yank and ship it too to hide that
>>> >>> complexity as well. But then there's a host of issues how to handle
>>> >>> potentially different rJava versions installed on different nodes...
>>> So, it
>>> >>> increasingly looks like something we might want to defer to sysops
>>> to do
>>> >>> with approximate set of requirements .
>>> >>>
>>> >>>
>>> >>> On Mon, Nov 12, 2012 at 5:29 PM, Josh Wills <[email protected]>
>>> wrote:
>>> >>>
>>> >>>> On Mon, Nov 12, 2012 at 5:17 PM, Dmitriy Lyubimov <
>>> [email protected]>
>>> >>>> wrote:
>>> >>>>
>>> >>>> > so java tasks need to be able to load libjri.so from
>>> >>>> > whatever system.file("jri", package="rJava") says.
>>> >>>> >
>>> >>>> > Traditionally, these issues were handled with -Djava.library.path.
>>> >>>> > Apparently there's nothing java task can do to enable
>>> loadLibrary()
>>> >>>> command
>>> >>>> > to see the damn library once started. But -Djava.library.path
>>> requires
>>> >>>> for
>>> >>>> > nodes to configure and lock jvm command line from modifications
>>> of the
>>> >>>> > client.  which is fine.
>>> >>>> >
>>> >>>> > I also discovered that LD_LIBRARY_PATH actually works with jre 1.6
>>> >>>> (again).
>>> >>>> >
>>> >>>> > but... any other suggestions about best practice configuring
>>> crunch to
>>> >>>> run
>>> >>>> > user's .so's?
>>> >>>> >
>>> >>>>
>>> >>>> Not off the top of my head. I suspect that whatever you come up
>>> with will
>>> >>>> become the "best practice." :)
>>> >>>>
>>> >>>> >
>>> >>>> > thanks.
>>> >>>> >
>>> >>>> >
>>> >>>> >
>>> >>>> >
>>> >>>> >
>>> >>>> >
>>> >>>> > On Sun, Nov 11, 2012 at 1:41 PM, Josh Wills <[email protected]
>>> >
>>> >>>> wrote:
>>> >>>> >
>>> >>>> > > I believe that is a safe assumption, at least right now.
>>> >>>> > >
>>> >>>> > >
>>> >>>> > > On Sun, Nov 11, 2012 at 1:38 PM, Dmitriy Lyubimov <
>>> [email protected]
>>> >>>> >
>>> >>>> > > wrote:
>>> >>>> > >
>>> >>>> > > > Question.
>>> >>>> > > >
>>> >>>> > > > So in Crunch api, initialize() doesn't get an emitter. and the
>>> >>>> process
>>> >>>> > > gets
>>> >>>> > > > emitter every time.
>>> >>>> > > >
>>> >>>> > > > However, my guess any single reincranation of a DoFn object
>>> in the
>>> >>>> > > backend
>>> >>>> > > > will always be getting the same emitter thru its lifecycle.
>>> Is it
>>> >>>> an
>>> >>>> > > > admissible assumption or there's currently a counter example
>>> to
>>> >>>> that?
>>> >>>> > > >
>>> >>>> > > > The problem is that as i implement the two way pipeline of
>>> input
>>> >>>> and
>>> >>>> > > > emitter data between R and Java, I am bulking these calls
>>> together
>>> >>>> for
>>> >>>> > > > performance reasons. Each individual datum in these chunks of
>>> data
>>> >>>> will
>>> >>>> > > not
>>> >>>> > > > have attached emitter function information to them in any way.
>>> >>>> (well it
>>> >>>> > > > could but it would be a performance killer and i bet emitter
>>> never
>>> >>>> > > > changes).
>>> >>>> > > >
>>> >>>> > > > So, thoughts? can i assume emitter never changes between
>>> first and
>>> >>>> lass
>>> >>>> > > > call to DoFn instance?
>>> >>>> > > >
>>> >>>> > > > thanks.
>>> >>>> > > >
>>> >>>> > > >
>>> >>>> > > > On Mon, Oct 29, 2012 at 6:32 PM, Dmitriy Lyubimov <
>>> >>>> [email protected]>
>>> >>>> > > > wrote:
>>> >>>> > > >
>>> >>>> > > > > yes...
>>> >>>> > > > >
>>> >>>> > > > > i think it worked for me before, although just adding all
>>> jars
>>> >>>> from R
>>> >>>> > > > > package distribution would be a little bit more appropriate
>>> >>>> approach
>>> >>>> > > > > -- but it creates a problem with jars in dependent R
>>> packages. I
>>> >>>> > think
>>> >>>> > > > > it would be much easier to just compile a hadoop-job file
>>> and
>>> >>>> stick
>>> >>>> > it
>>> >>>> > > > > in rather than doing cherry-picking of individual jars from
>>> who
>>> >>>> knows
>>> >>>> > > > > how many locations.
>>> >>>> > > > >
>>> >>>> > > > > i think i used the hadoop job format with distributed cache
>>> >>>> before
>>> >>>> > and
>>> >>>> > > > > it worked... at least with Pig "register jar" functionality.
>>> >>>> > > > >
>>> >>>> > > > > ok i guess i will just try if it works.
>>> >>>> > > > >
>>> >>>> > > > > On Mon, Oct 29, 2012 at 6:24 PM, Josh Wills <
>>> [email protected]
>>> >>>> >
>>> >>>> > > wrote:
>>> >>>> > > > > > On Mon, Oct 29, 2012 at 5:46 PM, Dmitriy Lyubimov <
>>> >>>> > [email protected]
>>> >>>> > > >
>>> >>>> > > > > wrote:
>>> >>>> > > > > >
>>> >>>> > > > > >> Great! so it is in Crunch.
>>> >>>> > > > > >>
>>> >>>> > > > > >> does it support hadoop-job jar format or only pure java
>>> jars?
>>> >>>> > > > > >>
>>> >>>> > > > > >
>>> >>>> > > > > > I think just pure jars-- you're referring to hadoop-job
>>> format
>>> >>>> as
>>> >>>> > > > having
>>> >>>> > > > > > all the dependencies in a lib/ directory within the jar?
>>> >>>> > > > > >
>>> >>>> > > > > >
>>> >>>> > > > > >>
>>> >>>> > > > > >> On Mon, Oct 29, 2012 at 5:10 PM, Josh Wills <
>>> >>>> [email protected]>
>>> >>>> > > > > wrote:
>>> >>>> > > > > >> > On Mon, Oct 29, 2012 at 5:04 PM, Dmitriy Lyubimov <
>>> >>>> > > > [email protected]>
>>> >>>> > > > > >> wrote:
>>> >>>> > > > > >> >
>>> >>>> > > > > >> >> I think i need functionality to add more jars (or
>>> external
>>> >>>> > > > > hadoop-jar)
>>> >>>> > > > > >> >> to drive that from an R package. Just setting job jar
>>> by
>>> >>>> class
>>> >>>> > is
>>> >>>> > > > not
>>> >>>> > > > > >> >> enough. I can push overall job-jar as an addiitonal
>>> jar to
>>> >>>> R
>>> >>>> > > > package;
>>> >>>> > > > > >> >> however, i cannot really run hadoop command line on
>>> it, i
>>> >>>> need
>>> >>>> > to
>>> >>>> > > > set
>>> >>>> > > > > >> >> up classpath thru RJava.
>>> >>>> > > > > >> >>
>>> >>>> > > > > >> >> Traditional single hadoop job jar will unlikely work
>>> here
>>> >>>> since
>>> >>>> > > we
>>> >>>> > > > > >> >> cannot hardcode pipelines in java code but rather
>>> have to
>>> >>>> > > construct
>>> >>>> > > > > >> >> them on the fly. (well, we could serialize pipeline
>>> >>>> definitions
>>> >>>> > > > from
>>> >>>> > > > > R
>>> >>>> > > > > >> >> and then replay them in a driver -- but that's too
>>> >>>> cumbersome
>>> >>>> > and
>>> >>>> > > > > more
>>> >>>> > > > > >> >> work than it has to be.) There's no reason why i
>>> shouldn't
>>> >>>> be
>>> >>>> > > able
>>> >>>> > > > to
>>> >>>> > > > > >> >> do pig-like "register jar" or "setJobJar"
>>> (mahout-like)
>>> >>>> when
>>> >>>> > > > kicking
>>> >>>> > > > > >> >> off a pipeline.
>>> >>>> > > > > >> >>
>>> >>>> > > > > >> >
>>> >>>> > > > > >> > o.a.c.util.DistCache.addJarToDistributedCache?
>>> >>>> > > > > >> >
>>> >>>> > > > > >> >
>>> >>>> > > > > >> >>
>>> >>>> > > > > >> >>
>>> >>>> > > > > >> >> On Mon, Oct 29, 2012 at 10:17 AM, Dmitriy Lyubimov <
>>> >>>> > > > > [email protected]>
>>> >>>> > > > > >> >> wrote:
>>> >>>> > > > > >> >> > Ok, sounds very promising...
>>> >>>> > > > > >> >> >
>>> >>>> > > > > >> >> > i'll try to start digging on the driver part this
>>> week
>>> >>>> then
>>> >>>> > > > > (Pipeline
>>> >>>> > > > > >> >> > wrapper in R5).
>>> >>>> > > > > >> >> >
>>> >>>> > > > > >> >> > On Sun, Oct 28, 2012 at 11:56 AM, Josh Wills <
>>> >>>> > > > [email protected]
>>> >>>> > > > > >
>>> >>>> > > > > >> >> wrote:
>>> >>>> > > > > >> >> >> On Fri, Oct 26, 2012 at 2:40 PM, Dmitriy Lyubimov <
>>> >>>> > > > > [email protected]
>>> >>>> > > > > >> >
>>> >>>> > > > > >> >> wrote:
>>> >>>> > > > > >> >> >>> Ok, cool.
>>> >>>> > > > > >> >> >>>
>>> >>>> > > > > >> >> >>> So what state is Crunch in? I take it is in a
>>> fairly
>>> >>>> > advanced
>>> >>>> > > > > state.
>>> >>>> > > > > >> >> >>> So every api mentioned in the  FlumeJava paper is
>>> >>>> working ,
>>> >>>> > > > > right?
>>> >>>> > > > > >> Or
>>> >>>> > > > > >> >> >>> there's something that is not working
>>> specifically?
>>> >>>> > > > > >> >> >>
>>> >>>> > > > > >> >> >> I think the only thing in the paper that we don't
>>> have
>>> >>>> in a
>>> >>>> > > > > working
>>> >>>> > > > > >> >> >> state is MSCR fusion. It's mostly just a question
>>> of
>>> >>>> > > > prioritizing
>>> >>>> > > > > it
>>> >>>> > > > > >> >> >> and getting the work done.
>>> >>>> > > > > >> >> >>
>>> >>>> > > > > >> >> >>>
>>> >>>> > > > > >> >> >>> On Fri, Oct 26, 2012 at 2:31 PM, Josh Wills <
>>> >>>> > > > [email protected]
>>> >>>> > > > > >
>>> >>>> > > > > >> >> wrote:
>>> >>>> > > > > >> >> >>>> Hey Dmitriy,
>>> >>>> > > > > >> >> >>>>
>>> >>>> > > > > >> >> >>>> Got a fork going and looking forward to playing
>>> with
>>> >>>> > crunchR
>>> >>>> > > > > this
>>> >>>> > > > > >> >> weekend--
>>> >>>> > > > > >> >> >>>> thanks!
>>> >>>> > > > > >> >> >>>>
>>> >>>> > > > > >> >> >>>> J
>>> >>>> > > > > >> >> >>>>
>>> >>>> > > > > >> >> >>>> On Wed, Oct 24, 2012 at 1:28 PM, Dmitriy
>>> Lyubimov <
>>> >>>> > > > > >> [email protected]>
>>> >>>> > > > > >> >> wrote:
>>> >>>> > > > > >> >> >>>>
>>> >>>> > > > > >> >> >>>>> Project template
>>> >>>> https://github.com/dlyubimov/crunchR
>>> >>>> > > > > >> >> >>>>>
>>> >>>> > > > > >> >> >>>>> Default profile does not compile R artifact . R
>>> >>>> profile
>>> >>>> > > > > compiles R
>>> >>>> > > > > >> >> >>>>> artifact. for convenience, it is enabled by
>>> >>>> supplying -DR
>>> >>>> > > to
>>> >>>> > > > > mvn
>>> >>>> > > > > >> >> >>>>> command line, e.g.
>>> >>>> > > > > >> >> >>>>>
>>> >>>> > > > > >> >> >>>>> mvn install -DR
>>> >>>> > > > > >> >> >>>>>
>>> >>>> > > > > >> >> >>>>> there's also a helper that installs the snapshot
>>> >>>> version
>>> >>>> > of
>>> >>>> > > > the
>>> >>>> > > > > >> >> >>>>> package in the crunchR module.
>>> >>>> > > > > >> >> >>>>>
>>> >>>> > > > > >> >> >>>>> There's RJava and JRI java dependencies which i
>>> did
>>> >>>> not
>>> >>>> > > find
>>> >>>> > > > > >> anywhere
>>> >>>> > > > > >> >> >>>>> in public maven repos; so it is installed into
>>> my
>>> >>>> github
>>> >>>> > > > maven
>>> >>>> > > > > >> repo
>>> >>>> > > > > >> >> so
>>> >>>> > > > > >> >> >>>>> far. Should compile for 3rd party.
>>> >>>> > > > > >> >> >>>>>
>>> >>>> > > > > >> >> >>>>> -DR compilation requires R, RJava and
>>> optionally,
>>> >>>> > > RProtoBuf.
>>> >>>> > > > R
>>> >>>> > > > > Doc
>>> >>>> > > > > >> >> >>>>> compilation requires roxygen2 (i think).
>>> >>>> > > > > >> >> >>>>>
>>> >>>> > > > > >> >> >>>>> For some reason RProtoBuf fails to import into
>>> >>>> another
>>> >>>> > > > package,
>>> >>>> > > > > >> got a
>>> >>>> > > > > >> >> >>>>> weird exception when i put @import RProtoBuf
>>> into
>>> >>>> > crunchR,
>>> >>>> > > so
>>> >>>> > > > > >> >> >>>>> RProtoBuf is now in "Suggests" category. Down
>>> the
>>> >>>> road
>>> >>>> > that
>>> >>>> > > > may
>>> >>>> > > > > >> be a
>>> >>>> > > > > >> >> >>>>> problem though...
>>> >>>> > > > > >> >> >>>>>
>>> >>>> > > > > >> >> >>>>> other than the template, not much else has been
>>> done
>>> >>>> so
>>> >>>> > > > far...
>>> >>>> > > > > >> >> finding
>>> >>>> > > > > >> >> >>>>> hadoop libraries and adding it to the package
>>> path on
>>> >>>> > > > > >> initialization
>>> >>>> > > > > >> >> >>>>> via "hadoop classpath"... adding Crunch jars
>>> and its
>>> >>>> > > > > >> non-"provided"
>>> >>>> > > > > >> >> >>>>> transitives to the crunchR's java part...
>>> >>>> > > > > >> >> >>>>>
>>> >>>> > > > > >> >> >>>>> No legal stuff...
>>> >>>> > > > > >> >> >>>>>
>>> >>>> > > > > >> >> >>>>> No readmes... complete stealth at this point.
>>> >>>> > > > > >> >> >>>>>
>>> >>>> > > > > >> >> >>>>> On Thu, Oct 18, 2012 at 12:35 PM, Dmitriy
>>> Lyubimov <
>>> >>>> > > > > >> >> [email protected]>
>>> >>>> > > > > >> >> >>>>> wrote:
>>> >>>> > > > > >> >> >>>>> > Ok, cool. I will try to roll project template
>>> by
>>> >>>> some
>>> >>>> > > time
>>> >>>> > > > > next
>>> >>>> > > > > >> >> week.
>>> >>>> > > > > >> >> >>>>> > we can start with prototyping and benchmarking
>>> >>>> > something
>>> >>>> > > > > really
>>> >>>> > > > > >> >> >>>>> > simple, such as parallelDo().
>>> >>>> > > > > >> >> >>>>> >
>>> >>>> > > > > >> >> >>>>> > My interim goal is to perhaps take some more
>>> or
>>> >>>> less
>>> >>>> > > simple
>>> >>>> > > > > >> >> algorithm
>>> >>>> > > > > >> >> >>>>> > from Mahout and demonstrate it can be solved
>>> with
>>> >>>> > Rcrunch
>>> >>>> > > > (or
>>> >>>> > > > > >> >> whatever
>>> >>>> > > > > >> >> >>>>> > name it has to be) in a comparable time
>>> >>>> (performance)
>>> >>>> > but
>>> >>>> > > > > with
>>> >>>> > > > > >> much
>>> >>>> > > > > >> >> >>>>> > fewer lines of code. (say one of
>>> factorization or
>>> >>>> > > > clustering
>>> >>>> > > > > >> >> things)
>>> >>>> > > > > >> >> >>>>> >
>>> >>>> > > > > >> >> >>>>> >
>>> >>>> > > > > >> >> >>>>> > On Wed, Oct 17, 2012 at 10:24 PM, Rahul <
>>> >>>> > > [email protected]
>>> >>>> > > > >
>>> >>>> > > > > >> wrote:
>>> >>>> > > > > >> >> >>>>> >> I am not much of R user but I am interested
>>> to
>>> >>>> see how
>>> >>>> > > > well
>>> >>>> > > > > we
>>> >>>> > > > > >> can
>>> >>>> > > > > >> >> >>>>> integrate
>>> >>>> > > > > >> >> >>>>> >> the two. I would be happy to help.
>>> >>>> > > > > >> >> >>>>> >>
>>> >>>> > > > > >> >> >>>>> >> regards,
>>> >>>> > > > > >> >> >>>>> >> Rahul
>>> >>>> > > > > >> >> >>>>> >>
>>> >>>> > > > > >> >> >>>>> >> On 18-10-2012 04:04, Josh Wills wrote:
>>> >>>> > > > > >> >> >>>>> >>>
>>> >>>> > > > > >> >> >>>>> >>> On Wed, Oct 17, 2012 at 3:07 PM, Dmitriy
>>> >>>> Lyubimov <
>>> >>>> > > > > >> >> [email protected]>
>>> >>>> > > > > >> >> >>>>> >>> wrote:
>>> >>>> > > > > >> >> >>>>> >>>>
>>> >>>> > > > > >> >> >>>>> >>>> Yep, ok.
>>> >>>> > > > > >> >> >>>>> >>>>
>>> >>>> > > > > >> >> >>>>> >>>> I imagine it has to be an R module so I
>>> can set
>>> >>>> up a
>>> >>>> > > > maven
>>> >>>> > > > > >> >> project
>>> >>>> > > > > >> >> >>>>> >>>> with java/R code tree (I have been doing
>>> that a
>>> >>>> lot
>>> >>>> > > > > lately).
>>> >>>> > > > > >> Or
>>> >>>> > > > > >> >> if you
>>> >>>> > > > > >> >> >>>>> >>>> have a template to look at, it would be
>>> useful i
>>> >>>> > guess
>>> >>>> > > > > too.
>>> >>>> > > > > >> >> >>>>> >>>
>>> >>>> > > > > >> >> >>>>> >>> No, please go right ahead.
>>> >>>> > > > > >> >> >>>>> >>>
>>> >>>> > > > > >> >> >>>>> >>>>
>>> >>>> > > > > >> >> >>>>> >>>> On Wed, Oct 17, 2012 at 3:02 PM, Josh
>>> Wills <
>>> >>>> > > > > >> >> [email protected]>
>>> >>>> > > > > >> >> >>>>> wrote:
>>> >>>> > > > > >> >> >>>>> >>>>>
>>> >>>> > > > > >> >> >>>>> >>>>> I'd like it to be separate at first, but
>>> I am
>>> >>>> happy
>>> >>>> > > to
>>> >>>> > > > > help.
>>> >>>> > > > > >> >> Github
>>> >>>> > > > > >> >> >>>>> >>>>> repo?
>>> >>>> > > > > >> >> >>>>> >>>>> On Oct 17, 2012 2:57 PM, "Dmitriy
>>> Lyubimov" <
>>> >>>> > > > > >> [email protected]
>>> >>>> > > > > >> >> >
>>> >>>> > > > > >> >> >>>>> wrote:
>>> >>>> > > > > >> >> >>>>> >>>>>
>>> >>>> > > > > >> >> >>>>> >>>>>> Ok maybe there's a benefit to try a
>>> JRI/RJava
>>> >>>> > > > prototype
>>> >>>> > > > > on
>>> >>>> > > > > >> >> top of
>>> >>>> > > > > >> >> >>>>> >>>>>> Crunch for something simple. This should
>>> both
>>> >>>> save
>>> >>>> > > > time
>>> >>>> > > > > and
>>> >>>> > > > > >> >> prove or
>>> >>>> > > > > >> >> >>>>> >>>>>> disprove if Crunch via RJava integration
>>> is
>>> >>>> > viable.
>>> >>>> > > > > >> >> >>>>> >>>>>>
>>> >>>> > > > > >> >> >>>>> >>>>>> On my part i can try to do it within
>>> Crunch
>>> >>>> > > framework
>>> >>>> > > > > or we
>>> >>>> > > > > >> >> can keep
>>> >>>> > > > > >> >> >>>>> >>>>>> it completely separate.
>>> >>>> > > > > >> >> >>>>> >>>>>>
>>> >>>> > > > > >> >> >>>>> >>>>>> -d
>>> >>>> > > > > >> >> >>>>> >>>>>>
>>> >>>> > > > > >> >> >>>>> >>>>>> On Wed, Oct 17, 2012 at 2:08 PM, Josh
>>> Wills <
>>> >>>> > > > > >> >> [email protected]>
>>> >>>> > > > > >> >> >>>>> >>>>>> wrote:
>>> >>>> > > > > >> >> >>>>> >>>>>>>
>>> >>>> > > > > >> >> >>>>> >>>>>>> I am an avid R user and would be into
>>> it--
>>> >>>> who
>>> >>>> > gave
>>> >>>> > > > the
>>> >>>> > > > > >> >> talk? Was
>>> >>>> > > > > >> >> >>>>> it
>>> >>>> > > > > >> >> >>>>> >>>>>>> Murray Stokely?
>>> >>>> > > > > >> >> >>>>> >>>>>>>
>>> >>>> > > > > >> >> >>>>> >>>>>>> On Wed, Oct 17, 2012 at 2:05 PM, Dmitriy
>>> >>>> > Lyubimov <
>>> >>>> > > > > >> >> >>>>> [email protected]>
>>> >>>> > > > > >> >> >>>>> >>>>>>
>>> >>>> > > > > >> >> >>>>> >>>>>> wrote:
>>> >>>> > > > > >> >> >>>>> >>>>>>>>
>>> >>>> > > > > >> >> >>>>> >>>>>>>> Hello,
>>> >>>> > > > > >> >> >>>>> >>>>>>>>
>>> >>>> > > > > >> >> >>>>> >>>>>>>> I was pretty excited to learn of
>>> Google's
>>> >>>> > > experience
>>> >>>> > > > > of R
>>> >>>> > > > > >> >> mapping
>>> >>>> > > > > >> >> >>>>> of
>>> >>>> > > > > >> >> >>>>> >>>>>>>> flume java on one of recent BARUGs. I
>>> think
>>> >>>> a
>>> >>>> > lot
>>> >>>> > > of
>>> >>>> > > > > >> >> applications
>>> >>>> > > > > >> >> >>>>> >>>>>>>> similar to what we do in Mahout could
>>> be
>>> >>>> > > prototyped
>>> >>>> > > > > using
>>> >>>> > > > > >> >> flume R.
>>> >>>> > > > > >> >> >>>>> >>>>>>>>
>>> >>>> > > > > >> >> >>>>> >>>>>>>> I did not quite get the details of
>>> Google
>>> >>>> > > > > implementation
>>> >>>> > > > > >> of
>>> >>>> > > > > >> >> R
>>> >>>> > > > > >> >> >>>>> >>>>>>>> mapping,
>>> >>>> > > > > >> >> >>>>> >>>>>>>> but i am not sure if just a direct
>>> mapping
>>> >>>> from
>>> >>>> > R
>>> >>>> > > to
>>> >>>> > > > > >> Crunch
>>> >>>> > > > > >> >> would
>>> >>>> > > > > >> >> >>>>> be
>>> >>>> > > > > >> >> >>>>> >>>>>>>> sufficient (and, for most part,
>>> efficient).
>>> >>>> > > > RJava/JRI
>>> >>>> > > > > and
>>> >>>> > > > > >> >> jni
>>> >>>> > > > > >> >> >>>>> seem to
>>> >>>> > > > > >> >> >>>>> >>>>>>>> be a pretty terrible performer to do
>>> that
>>> >>>> > > directly.
>>> >>>> > > > > >> >> >>>>> >>>>>>>>
>>> >>>> > > > > >> >> >>>>> >>>>>>>>
>>> >>>> > > > > >> >> >>>>> >>>>>>>> on top of it, I am thinknig if this
>>> project
>>> >>>> > could
>>> >>>> > > > > have a
>>> >>>> > > > > >> >> >>>>> contributed
>>> >>>> > > > > >> >> >>>>> >>>>>>>> adapter to Mahout's distributed
>>> matrices,
>>> >>>> that
>>> >>>> > > would
>>> >>>> > > > > be
>>> >>>> > > > > >> >> just a
>>> >>>> > > > > >> >> >>>>> very
>>> >>>> > > > > >> >> >>>>> >>>>>>>> good synergy.
>>> >>>> > > > > >> >> >>>>> >>>>>>>>
>>> >>>> > > > > >> >> >>>>> >>>>>>>> Is there anyone interested in
>>> >>>> > > contributing/advising
>>> >>>> > > > > for
>>> >>>> > > > > >> open
>>> >>>> > > > > >> >> >>>>> source
>>> >>>> > > > > >> >> >>>>> >>>>>>>> version of flume R support? Just
>>> gauging
>>> >>>> > interest,
>>> >>>> > > > > Crunch
>>> >>>> > > > > >> >> list
>>> >>>> > > > > >> >> >>>>> seems
>>> >>>> > > > > >> >> >>>>> >>>>>>>> like a natural place to poke.
>>> >>>> > > > > >> >> >>>>> >>>>>>>>
>>> >>>> > > > > >> >> >>>>> >>>>>>>> Thanks .
>>> >>>> > > > > >> >> >>>>> >>>>>>>>
>>> >>>> > > > > >> >> >>>>> >>>>>>>> -Dmitriy
>>> >>>> > > > > >> >> >>>>> >>>>>>>
>>> >>>> > > > > >> >> >>>>> >>>>>>>
>>> >>>> > > > > >> >> >>>>> >>>>>>>
>>> >>>> > > > > >> >> >>>>> >>>>>>> --
>>> >>>> > > > > >> >> >>>>> >>>>>>> Director of Data Science
>>> >>>> > > > > >> >> >>>>> >>>>>>> Cloudera
>>> >>>> > > > > >> >> >>>>> >>>>>>> Twitter: @josh_wills
>>> >>>> > > > > >> >> >>>>> >>>
>>> >>>> > > > > >> >> >>>>> >>>
>>> >>>> > > > > >> >> >>>>> >>>
>>> >>>> > > > > >> >> >>>>> >>
>>> >>>> > > > > >> >> >>>>>
>>> >>>> > > > > >> >> >>>>
>>> >>>> > > > > >> >> >>>>
>>> >>>> > > > > >> >> >>>>
>>> >>>> > > > > >> >> >>>> --
>>> >>>> > > > > >> >> >>>> Director of Data Science
>>> >>>> > > > > >> >> >>>> Cloudera <http://www.cloudera.com>
>>> >>>> > > > > >> >> >>>> Twitter: @josh_wills <
>>> http://twitter.com/josh_wills>
>>> >>>> > > > > >> >>
>>> >>>> > > > > >> >
>>> >>>> > > > > >> >
>>> >>>> > > > > >> >
>>> >>>> > > > > >> > --
>>> >>>> > > > > >> > Director of Data Science
>>> >>>> > > > > >> > Cloudera <http://www.cloudera.com>
>>> >>>> > > > > >> > Twitter: @josh_wills <http://twitter.com/josh_wills>
>>> >>>> > > > > >>
>>> >>>> > > > > >
>>> >>>> > > > > >
>>> >>>> > > > > >
>>> >>>> > > > > > --
>>> >>>> > > > > > Director of Data Science
>>> >>>> > > > > > Cloudera <http://www.cloudera.com>
>>> >>>> > > > > > Twitter: @josh_wills <http://twitter.com/josh_wills>
>>> >>>> > > > >
>>> >>>> > > >
>>> >>>> > >
>>> >>>> >
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> --
>>> >>>> Director of Data Science
>>> >>>> Cloudera <http://www.cloudera.com>
>>> >>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>> >>>>
>>> >>>
>>> >>>
>>> >>
>>>
>>>
>>>
>>> --
>>> Director of Data Science
>>> Cloudera
>>> Twitter: @josh_wills
>>>
>>
>>
>

Re: Flume R -- any interest?

Reply via email to