Or RTNode? I guess i am not sure what difference is. Bottom line, i need to do some task startup routines (e.g. establish exchange queues between task and R) and also last thing cleanup before MR tasks exits and _before all outputs are closed_. (kind of "flush all" thing).
Thanks. -d On Fri, Nov 16, 2012 at 3:04 PM, Dmitriy Lyubimov <[email protected]> wrote: > How do I hook into CrunchTaskContext to do a task cleanup (as opposed to a > DoFn etc.) ? > > > On Fri, Nov 16, 2012 at 2:52 PM, Dmitriy Lyubimov <[email protected]>wrote: > >> no it is fully distributed testing. >> >> It is ok, StatEt handles log4j logging for me so i see the logs. I was >> wondering if any end-to-end diagnostics is already embedded in Crunch but >> reporting backend errors to front end is notoriously hard (and sometimes, >> impossible) with hadoop, so I assume it doesn't make sense to report >> client-only stuff thru exception while the other stuff still requires >> checking isSucceeded(). >> >> >> >> On Fri, Nov 16, 2012 at 11:07 AM, Josh Wills <[email protected]> wrote: >> >>> Are you running this using LocalJobRunner? Does calling >>> Pipeline.enableDebug() before run() help? If it doesn't, it'll help >>> settle a debate I'm having w/Matthias. ;-) >>> >>> On Fri, Nov 16, 2012 at 10:22 AM, Dmitriy Lyubimov <[email protected]> >>> wrote: >>> > I see the error in the logs but Pipeline.run() has never thrown >>> anything. >>> > isSucceeded() subsequently returns false. Is there any way to extract >>> > client-side problem rather than just being able to state that job >>> failed? >>> > or it is ok and the only diagnostics by design? >>> > >>> > ============ >>> > 68124 [Thread-8] INFO org.apache.crunch.impl.mr.exec.CrunchJob - >>> > org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path >>> > does not exist: hdfs://localhost:11010/crunchr-example/input >>> > at >>> > >>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:231) >>> > at >>> > >>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:248) >>> > at >>> org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:944) >>> > at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:961) >>> > at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170) >>> > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:880) >>> > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833) >>> > at java.security.AccessController.doPrivileged(Native Method) >>> > at javax.security.auth.Subject.doAs(Subject.java:396) >>> > at >>> > >>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157) >>> > at >>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833) >>> > at org.apache.hadoop.mapreduce.Job.submit(Job.java:476) >>> > at >>> > >>> org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob.submit(CrunchControlledJob.java:331) >>> > at org.apache.crunch.impl.mr.exec.CrunchJob.submit(CrunchJob.java:135) >>> > at >>> > >>> org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.startReadyJobs(CrunchJobControl.java:251) >>> > at >>> > >>> org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.run(CrunchJobControl.java:279) >>> > at java.lang.Thread.run(Thread.java:662) >>> > >>> > >>> > On Mon, Nov 12, 2012 at 5:41 PM, Dmitriy Lyubimov <[email protected]> >>> wrote: >>> > >>> >> for hadoop nodes i guess yet another option to soft-link the .so into >>> >> hadoop's native lib folder >>> >> >>> >> >>> >> On Mon, Nov 12, 2012 at 5:37 PM, Dmitriy Lyubimov <[email protected] >>> >wrote: >>> >> >>> >>> I actually want to defer this to hadoop admins, we just need to >>> create a >>> >>> procedure for setting up nodes. Ideally as simple as possible. >>> something >>> >>> like >>> >>> >>> >>> 1) setup R >>> >>> 2) install.packages("rJava","RProtoBuf","crunchR") >>> >>> 3) R CMD javareconf >>> >>> 3) add result of R --vanilla <<< 'system.file("jri", >>> package="rJava") to >>> >>> either mapred command lines or LD_LIBRARY_PATH... >>> >>> >>> >>> but it will depend on their versions of hadoop, jre etc. I hoped >>> crunch >>> >>> might have something to hide a lot of that complexity (since it is >>> about >>> >>> hiding complexities, for the most part :) ) besides hadoop has a >>> way to >>> >>> ship .so's to the backend so if crunch had an api to do something >>> similar >>> >>> it is conceivable that driver might yank and ship it too to hide that >>> >>> complexity as well. But then there's a host of issues how to handle >>> >>> potentially different rJava versions installed on different nodes... >>> So, it >>> >>> increasingly looks like something we might want to defer to sysops >>> to do >>> >>> with approximate set of requirements . >>> >>> >>> >>> >>> >>> On Mon, Nov 12, 2012 at 5:29 PM, Josh Wills <[email protected]> >>> wrote: >>> >>> >>> >>>> On Mon, Nov 12, 2012 at 5:17 PM, Dmitriy Lyubimov < >>> [email protected]> >>> >>>> wrote: >>> >>>> >>> >>>> > so java tasks need to be able to load libjri.so from >>> >>>> > whatever system.file("jri", package="rJava") says. >>> >>>> > >>> >>>> > Traditionally, these issues were handled with -Djava.library.path. >>> >>>> > Apparently there's nothing java task can do to enable >>> loadLibrary() >>> >>>> command >>> >>>> > to see the damn library once started. But -Djava.library.path >>> requires >>> >>>> for >>> >>>> > nodes to configure and lock jvm command line from modifications >>> of the >>> >>>> > client. which is fine. >>> >>>> > >>> >>>> > I also discovered that LD_LIBRARY_PATH actually works with jre 1.6 >>> >>>> (again). >>> >>>> > >>> >>>> > but... any other suggestions about best practice configuring >>> crunch to >>> >>>> run >>> >>>> > user's .so's? >>> >>>> > >>> >>>> >>> >>>> Not off the top of my head. I suspect that whatever you come up >>> with will >>> >>>> become the "best practice." :) >>> >>>> >>> >>>> > >>> >>>> > thanks. >>> >>>> > >>> >>>> > >>> >>>> > >>> >>>> > >>> >>>> > >>> >>>> > >>> >>>> > On Sun, Nov 11, 2012 at 1:41 PM, Josh Wills <[email protected] >>> > >>> >>>> wrote: >>> >>>> > >>> >>>> > > I believe that is a safe assumption, at least right now. >>> >>>> > > >>> >>>> > > >>> >>>> > > On Sun, Nov 11, 2012 at 1:38 PM, Dmitriy Lyubimov < >>> [email protected] >>> >>>> > >>> >>>> > > wrote: >>> >>>> > > >>> >>>> > > > Question. >>> >>>> > > > >>> >>>> > > > So in Crunch api, initialize() doesn't get an emitter. and the >>> >>>> process >>> >>>> > > gets >>> >>>> > > > emitter every time. >>> >>>> > > > >>> >>>> > > > However, my guess any single reincranation of a DoFn object >>> in the >>> >>>> > > backend >>> >>>> > > > will always be getting the same emitter thru its lifecycle. >>> Is it >>> >>>> an >>> >>>> > > > admissible assumption or there's currently a counter example >>> to >>> >>>> that? >>> >>>> > > > >>> >>>> > > > The problem is that as i implement the two way pipeline of >>> input >>> >>>> and >>> >>>> > > > emitter data between R and Java, I am bulking these calls >>> together >>> >>>> for >>> >>>> > > > performance reasons. Each individual datum in these chunks of >>> data >>> >>>> will >>> >>>> > > not >>> >>>> > > > have attached emitter function information to them in any way. >>> >>>> (well it >>> >>>> > > > could but it would be a performance killer and i bet emitter >>> never >>> >>>> > > > changes). >>> >>>> > > > >>> >>>> > > > So, thoughts? can i assume emitter never changes between >>> first and >>> >>>> lass >>> >>>> > > > call to DoFn instance? >>> >>>> > > > >>> >>>> > > > thanks. >>> >>>> > > > >>> >>>> > > > >>> >>>> > > > On Mon, Oct 29, 2012 at 6:32 PM, Dmitriy Lyubimov < >>> >>>> [email protected]> >>> >>>> > > > wrote: >>> >>>> > > > >>> >>>> > > > > yes... >>> >>>> > > > > >>> >>>> > > > > i think it worked for me before, although just adding all >>> jars >>> >>>> from R >>> >>>> > > > > package distribution would be a little bit more appropriate >>> >>>> approach >>> >>>> > > > > -- but it creates a problem with jars in dependent R >>> packages. I >>> >>>> > think >>> >>>> > > > > it would be much easier to just compile a hadoop-job file >>> and >>> >>>> stick >>> >>>> > it >>> >>>> > > > > in rather than doing cherry-picking of individual jars from >>> who >>> >>>> knows >>> >>>> > > > > how many locations. >>> >>>> > > > > >>> >>>> > > > > i think i used the hadoop job format with distributed cache >>> >>>> before >>> >>>> > and >>> >>>> > > > > it worked... at least with Pig "register jar" functionality. >>> >>>> > > > > >>> >>>> > > > > ok i guess i will just try if it works. >>> >>>> > > > > >>> >>>> > > > > On Mon, Oct 29, 2012 at 6:24 PM, Josh Wills < >>> [email protected] >>> >>>> > >>> >>>> > > wrote: >>> >>>> > > > > > On Mon, Oct 29, 2012 at 5:46 PM, Dmitriy Lyubimov < >>> >>>> > [email protected] >>> >>>> > > > >>> >>>> > > > > wrote: >>> >>>> > > > > > >>> >>>> > > > > >> Great! so it is in Crunch. >>> >>>> > > > > >> >>> >>>> > > > > >> does it support hadoop-job jar format or only pure java >>> jars? >>> >>>> > > > > >> >>> >>>> > > > > > >>> >>>> > > > > > I think just pure jars-- you're referring to hadoop-job >>> format >>> >>>> as >>> >>>> > > > having >>> >>>> > > > > > all the dependencies in a lib/ directory within the jar? >>> >>>> > > > > > >>> >>>> > > > > > >>> >>>> > > > > >> >>> >>>> > > > > >> On Mon, Oct 29, 2012 at 5:10 PM, Josh Wills < >>> >>>> [email protected]> >>> >>>> > > > > wrote: >>> >>>> > > > > >> > On Mon, Oct 29, 2012 at 5:04 PM, Dmitriy Lyubimov < >>> >>>> > > > [email protected]> >>> >>>> > > > > >> wrote: >>> >>>> > > > > >> > >>> >>>> > > > > >> >> I think i need functionality to add more jars (or >>> external >>> >>>> > > > > hadoop-jar) >>> >>>> > > > > >> >> to drive that from an R package. Just setting job jar >>> by >>> >>>> class >>> >>>> > is >>> >>>> > > > not >>> >>>> > > > > >> >> enough. I can push overall job-jar as an addiitonal >>> jar to >>> >>>> R >>> >>>> > > > package; >>> >>>> > > > > >> >> however, i cannot really run hadoop command line on >>> it, i >>> >>>> need >>> >>>> > to >>> >>>> > > > set >>> >>>> > > > > >> >> up classpath thru RJava. >>> >>>> > > > > >> >> >>> >>>> > > > > >> >> Traditional single hadoop job jar will unlikely work >>> here >>> >>>> since >>> >>>> > > we >>> >>>> > > > > >> >> cannot hardcode pipelines in java code but rather >>> have to >>> >>>> > > construct >>> >>>> > > > > >> >> them on the fly. (well, we could serialize pipeline >>> >>>> definitions >>> >>>> > > > from >>> >>>> > > > > R >>> >>>> > > > > >> >> and then replay them in a driver -- but that's too >>> >>>> cumbersome >>> >>>> > and >>> >>>> > > > > more >>> >>>> > > > > >> >> work than it has to be.) There's no reason why i >>> shouldn't >>> >>>> be >>> >>>> > > able >>> >>>> > > > to >>> >>>> > > > > >> >> do pig-like "register jar" or "setJobJar" >>> (mahout-like) >>> >>>> when >>> >>>> > > > kicking >>> >>>> > > > > >> >> off a pipeline. >>> >>>> > > > > >> >> >>> >>>> > > > > >> > >>> >>>> > > > > >> > o.a.c.util.DistCache.addJarToDistributedCache? >>> >>>> > > > > >> > >>> >>>> > > > > >> > >>> >>>> > > > > >> >> >>> >>>> > > > > >> >> >>> >>>> > > > > >> >> On Mon, Oct 29, 2012 at 10:17 AM, Dmitriy Lyubimov < >>> >>>> > > > > [email protected]> >>> >>>> > > > > >> >> wrote: >>> >>>> > > > > >> >> > Ok, sounds very promising... >>> >>>> > > > > >> >> > >>> >>>> > > > > >> >> > i'll try to start digging on the driver part this >>> week >>> >>>> then >>> >>>> > > > > (Pipeline >>> >>>> > > > > >> >> > wrapper in R5). >>> >>>> > > > > >> >> > >>> >>>> > > > > >> >> > On Sun, Oct 28, 2012 at 11:56 AM, Josh Wills < >>> >>>> > > > [email protected] >>> >>>> > > > > > >>> >>>> > > > > >> >> wrote: >>> >>>> > > > > >> >> >> On Fri, Oct 26, 2012 at 2:40 PM, Dmitriy Lyubimov < >>> >>>> > > > > [email protected] >>> >>>> > > > > >> > >>> >>>> > > > > >> >> wrote: >>> >>>> > > > > >> >> >>> Ok, cool. >>> >>>> > > > > >> >> >>> >>> >>>> > > > > >> >> >>> So what state is Crunch in? I take it is in a >>> fairly >>> >>>> > advanced >>> >>>> > > > > state. >>> >>>> > > > > >> >> >>> So every api mentioned in the FlumeJava paper is >>> >>>> working , >>> >>>> > > > > right? >>> >>>> > > > > >> Or >>> >>>> > > > > >> >> >>> there's something that is not working >>> specifically? >>> >>>> > > > > >> >> >> >>> >>>> > > > > >> >> >> I think the only thing in the paper that we don't >>> have >>> >>>> in a >>> >>>> > > > > working >>> >>>> > > > > >> >> >> state is MSCR fusion. It's mostly just a question >>> of >>> >>>> > > > prioritizing >>> >>>> > > > > it >>> >>>> > > > > >> >> >> and getting the work done. >>> >>>> > > > > >> >> >> >>> >>>> > > > > >> >> >>> >>> >>>> > > > > >> >> >>> On Fri, Oct 26, 2012 at 2:31 PM, Josh Wills < >>> >>>> > > > [email protected] >>> >>>> > > > > > >>> >>>> > > > > >> >> wrote: >>> >>>> > > > > >> >> >>>> Hey Dmitriy, >>> >>>> > > > > >> >> >>>> >>> >>>> > > > > >> >> >>>> Got a fork going and looking forward to playing >>> with >>> >>>> > crunchR >>> >>>> > > > > this >>> >>>> > > > > >> >> weekend-- >>> >>>> > > > > >> >> >>>> thanks! >>> >>>> > > > > >> >> >>>> >>> >>>> > > > > >> >> >>>> J >>> >>>> > > > > >> >> >>>> >>> >>>> > > > > >> >> >>>> On Wed, Oct 24, 2012 at 1:28 PM, Dmitriy >>> Lyubimov < >>> >>>> > > > > >> [email protected]> >>> >>>> > > > > >> >> wrote: >>> >>>> > > > > >> >> >>>> >>> >>>> > > > > >> >> >>>>> Project template >>> >>>> https://github.com/dlyubimov/crunchR >>> >>>> > > > > >> >> >>>>> >>> >>>> > > > > >> >> >>>>> Default profile does not compile R artifact . R >>> >>>> profile >>> >>>> > > > > compiles R >>> >>>> > > > > >> >> >>>>> artifact. for convenience, it is enabled by >>> >>>> supplying -DR >>> >>>> > > to >>> >>>> > > > > mvn >>> >>>> > > > > >> >> >>>>> command line, e.g. >>> >>>> > > > > >> >> >>>>> >>> >>>> > > > > >> >> >>>>> mvn install -DR >>> >>>> > > > > >> >> >>>>> >>> >>>> > > > > >> >> >>>>> there's also a helper that installs the snapshot >>> >>>> version >>> >>>> > of >>> >>>> > > > the >>> >>>> > > > > >> >> >>>>> package in the crunchR module. >>> >>>> > > > > >> >> >>>>> >>> >>>> > > > > >> >> >>>>> There's RJava and JRI java dependencies which i >>> did >>> >>>> not >>> >>>> > > find >>> >>>> > > > > >> anywhere >>> >>>> > > > > >> >> >>>>> in public maven repos; so it is installed into >>> my >>> >>>> github >>> >>>> > > > maven >>> >>>> > > > > >> repo >>> >>>> > > > > >> >> so >>> >>>> > > > > >> >> >>>>> far. Should compile for 3rd party. >>> >>>> > > > > >> >> >>>>> >>> >>>> > > > > >> >> >>>>> -DR compilation requires R, RJava and >>> optionally, >>> >>>> > > RProtoBuf. >>> >>>> > > > R >>> >>>> > > > > Doc >>> >>>> > > > > >> >> >>>>> compilation requires roxygen2 (i think). >>> >>>> > > > > >> >> >>>>> >>> >>>> > > > > >> >> >>>>> For some reason RProtoBuf fails to import into >>> >>>> another >>> >>>> > > > package, >>> >>>> > > > > >> got a >>> >>>> > > > > >> >> >>>>> weird exception when i put @import RProtoBuf >>> into >>> >>>> > crunchR, >>> >>>> > > so >>> >>>> > > > > >> >> >>>>> RProtoBuf is now in "Suggests" category. Down >>> the >>> >>>> road >>> >>>> > that >>> >>>> > > > may >>> >>>> > > > > >> be a >>> >>>> > > > > >> >> >>>>> problem though... >>> >>>> > > > > >> >> >>>>> >>> >>>> > > > > >> >> >>>>> other than the template, not much else has been >>> done >>> >>>> so >>> >>>> > > > far... >>> >>>> > > > > >> >> finding >>> >>>> > > > > >> >> >>>>> hadoop libraries and adding it to the package >>> path on >>> >>>> > > > > >> initialization >>> >>>> > > > > >> >> >>>>> via "hadoop classpath"... adding Crunch jars >>> and its >>> >>>> > > > > >> non-"provided" >>> >>>> > > > > >> >> >>>>> transitives to the crunchR's java part... >>> >>>> > > > > >> >> >>>>> >>> >>>> > > > > >> >> >>>>> No legal stuff... >>> >>>> > > > > >> >> >>>>> >>> >>>> > > > > >> >> >>>>> No readmes... complete stealth at this point. >>> >>>> > > > > >> >> >>>>> >>> >>>> > > > > >> >> >>>>> On Thu, Oct 18, 2012 at 12:35 PM, Dmitriy >>> Lyubimov < >>> >>>> > > > > >> >> [email protected]> >>> >>>> > > > > >> >> >>>>> wrote: >>> >>>> > > > > >> >> >>>>> > Ok, cool. I will try to roll project template >>> by >>> >>>> some >>> >>>> > > time >>> >>>> > > > > next >>> >>>> > > > > >> >> week. >>> >>>> > > > > >> >> >>>>> > we can start with prototyping and benchmarking >>> >>>> > something >>> >>>> > > > > really >>> >>>> > > > > >> >> >>>>> > simple, such as parallelDo(). >>> >>>> > > > > >> >> >>>>> > >>> >>>> > > > > >> >> >>>>> > My interim goal is to perhaps take some more >>> or >>> >>>> less >>> >>>> > > simple >>> >>>> > > > > >> >> algorithm >>> >>>> > > > > >> >> >>>>> > from Mahout and demonstrate it can be solved >>> with >>> >>>> > Rcrunch >>> >>>> > > > (or >>> >>>> > > > > >> >> whatever >>> >>>> > > > > >> >> >>>>> > name it has to be) in a comparable time >>> >>>> (performance) >>> >>>> > but >>> >>>> > > > > with >>> >>>> > > > > >> much >>> >>>> > > > > >> >> >>>>> > fewer lines of code. (say one of >>> factorization or >>> >>>> > > > clustering >>> >>>> > > > > >> >> things) >>> >>>> > > > > >> >> >>>>> > >>> >>>> > > > > >> >> >>>>> > >>> >>>> > > > > >> >> >>>>> > On Wed, Oct 17, 2012 at 10:24 PM, Rahul < >>> >>>> > > [email protected] >>> >>>> > > > > >>> >>>> > > > > >> wrote: >>> >>>> > > > > >> >> >>>>> >> I am not much of R user but I am interested >>> to >>> >>>> see how >>> >>>> > > > well >>> >>>> > > > > we >>> >>>> > > > > >> can >>> >>>> > > > > >> >> >>>>> integrate >>> >>>> > > > > >> >> >>>>> >> the two. I would be happy to help. >>> >>>> > > > > >> >> >>>>> >> >>> >>>> > > > > >> >> >>>>> >> regards, >>> >>>> > > > > >> >> >>>>> >> Rahul >>> >>>> > > > > >> >> >>>>> >> >>> >>>> > > > > >> >> >>>>> >> On 18-10-2012 04:04, Josh Wills wrote: >>> >>>> > > > > >> >> >>>>> >>> >>> >>>> > > > > >> >> >>>>> >>> On Wed, Oct 17, 2012 at 3:07 PM, Dmitriy >>> >>>> Lyubimov < >>> >>>> > > > > >> >> [email protected]> >>> >>>> > > > > >> >> >>>>> >>> wrote: >>> >>>> > > > > >> >> >>>>> >>>> >>> >>>> > > > > >> >> >>>>> >>>> Yep, ok. >>> >>>> > > > > >> >> >>>>> >>>> >>> >>>> > > > > >> >> >>>>> >>>> I imagine it has to be an R module so I >>> can set >>> >>>> up a >>> >>>> > > > maven >>> >>>> > > > > >> >> project >>> >>>> > > > > >> >> >>>>> >>>> with java/R code tree (I have been doing >>> that a >>> >>>> lot >>> >>>> > > > > lately). >>> >>>> > > > > >> Or >>> >>>> > > > > >> >> if you >>> >>>> > > > > >> >> >>>>> >>>> have a template to look at, it would be >>> useful i >>> >>>> > guess >>> >>>> > > > > too. >>> >>>> > > > > >> >> >>>>> >>> >>> >>>> > > > > >> >> >>>>> >>> No, please go right ahead. >>> >>>> > > > > >> >> >>>>> >>> >>> >>>> > > > > >> >> >>>>> >>>> >>> >>>> > > > > >> >> >>>>> >>>> On Wed, Oct 17, 2012 at 3:02 PM, Josh >>> Wills < >>> >>>> > > > > >> >> [email protected]> >>> >>>> > > > > >> >> >>>>> wrote: >>> >>>> > > > > >> >> >>>>> >>>>> >>> >>>> > > > > >> >> >>>>> >>>>> I'd like it to be separate at first, but >>> I am >>> >>>> happy >>> >>>> > > to >>> >>>> > > > > help. >>> >>>> > > > > >> >> Github >>> >>>> > > > > >> >> >>>>> >>>>> repo? >>> >>>> > > > > >> >> >>>>> >>>>> On Oct 17, 2012 2:57 PM, "Dmitriy >>> Lyubimov" < >>> >>>> > > > > >> [email protected] >>> >>>> > > > > >> >> > >>> >>>> > > > > >> >> >>>>> wrote: >>> >>>> > > > > >> >> >>>>> >>>>> >>> >>>> > > > > >> >> >>>>> >>>>>> Ok maybe there's a benefit to try a >>> JRI/RJava >>> >>>> > > > prototype >>> >>>> > > > > on >>> >>>> > > > > >> >> top of >>> >>>> > > > > >> >> >>>>> >>>>>> Crunch for something simple. This should >>> both >>> >>>> save >>> >>>> > > > time >>> >>>> > > > > and >>> >>>> > > > > >> >> prove or >>> >>>> > > > > >> >> >>>>> >>>>>> disprove if Crunch via RJava integration >>> is >>> >>>> > viable. >>> >>>> > > > > >> >> >>>>> >>>>>> >>> >>>> > > > > >> >> >>>>> >>>>>> On my part i can try to do it within >>> Crunch >>> >>>> > > framework >>> >>>> > > > > or we >>> >>>> > > > > >> >> can keep >>> >>>> > > > > >> >> >>>>> >>>>>> it completely separate. >>> >>>> > > > > >> >> >>>>> >>>>>> >>> >>>> > > > > >> >> >>>>> >>>>>> -d >>> >>>> > > > > >> >> >>>>> >>>>>> >>> >>>> > > > > >> >> >>>>> >>>>>> On Wed, Oct 17, 2012 at 2:08 PM, Josh >>> Wills < >>> >>>> > > > > >> >> [email protected]> >>> >>>> > > > > >> >> >>>>> >>>>>> wrote: >>> >>>> > > > > >> >> >>>>> >>>>>>> >>> >>>> > > > > >> >> >>>>> >>>>>>> I am an avid R user and would be into >>> it-- >>> >>>> who >>> >>>> > gave >>> >>>> > > > the >>> >>>> > > > > >> >> talk? Was >>> >>>> > > > > >> >> >>>>> it >>> >>>> > > > > >> >> >>>>> >>>>>>> Murray Stokely? >>> >>>> > > > > >> >> >>>>> >>>>>>> >>> >>>> > > > > >> >> >>>>> >>>>>>> On Wed, Oct 17, 2012 at 2:05 PM, Dmitriy >>> >>>> > Lyubimov < >>> >>>> > > > > >> >> >>>>> [email protected]> >>> >>>> > > > > >> >> >>>>> >>>>>> >>> >>>> > > > > >> >> >>>>> >>>>>> wrote: >>> >>>> > > > > >> >> >>>>> >>>>>>>> >>> >>>> > > > > >> >> >>>>> >>>>>>>> Hello, >>> >>>> > > > > >> >> >>>>> >>>>>>>> >>> >>>> > > > > >> >> >>>>> >>>>>>>> I was pretty excited to learn of >>> Google's >>> >>>> > > experience >>> >>>> > > > > of R >>> >>>> > > > > >> >> mapping >>> >>>> > > > > >> >> >>>>> of >>> >>>> > > > > >> >> >>>>> >>>>>>>> flume java on one of recent BARUGs. I >>> think >>> >>>> a >>> >>>> > lot >>> >>>> > > of >>> >>>> > > > > >> >> applications >>> >>>> > > > > >> >> >>>>> >>>>>>>> similar to what we do in Mahout could >>> be >>> >>>> > > prototyped >>> >>>> > > > > using >>> >>>> > > > > >> >> flume R. >>> >>>> > > > > >> >> >>>>> >>>>>>>> >>> >>>> > > > > >> >> >>>>> >>>>>>>> I did not quite get the details of >>> Google >>> >>>> > > > > implementation >>> >>>> > > > > >> of >>> >>>> > > > > >> >> R >>> >>>> > > > > >> >> >>>>> >>>>>>>> mapping, >>> >>>> > > > > >> >> >>>>> >>>>>>>> but i am not sure if just a direct >>> mapping >>> >>>> from >>> >>>> > R >>> >>>> > > to >>> >>>> > > > > >> Crunch >>> >>>> > > > > >> >> would >>> >>>> > > > > >> >> >>>>> be >>> >>>> > > > > >> >> >>>>> >>>>>>>> sufficient (and, for most part, >>> efficient). >>> >>>> > > > RJava/JRI >>> >>>> > > > > and >>> >>>> > > > > >> >> jni >>> >>>> > > > > >> >> >>>>> seem to >>> >>>> > > > > >> >> >>>>> >>>>>>>> be a pretty terrible performer to do >>> that >>> >>>> > > directly. >>> >>>> > > > > >> >> >>>>> >>>>>>>> >>> >>>> > > > > >> >> >>>>> >>>>>>>> >>> >>>> > > > > >> >> >>>>> >>>>>>>> on top of it, I am thinknig if this >>> project >>> >>>> > could >>> >>>> > > > > have a >>> >>>> > > > > >> >> >>>>> contributed >>> >>>> > > > > >> >> >>>>> >>>>>>>> adapter to Mahout's distributed >>> matrices, >>> >>>> that >>> >>>> > > would >>> >>>> > > > > be >>> >>>> > > > > >> >> just a >>> >>>> > > > > >> >> >>>>> very >>> >>>> > > > > >> >> >>>>> >>>>>>>> good synergy. >>> >>>> > > > > >> >> >>>>> >>>>>>>> >>> >>>> > > > > >> >> >>>>> >>>>>>>> Is there anyone interested in >>> >>>> > > contributing/advising >>> >>>> > > > > for >>> >>>> > > > > >> open >>> >>>> > > > > >> >> >>>>> source >>> >>>> > > > > >> >> >>>>> >>>>>>>> version of flume R support? Just >>> gauging >>> >>>> > interest, >>> >>>> > > > > Crunch >>> >>>> > > > > >> >> list >>> >>>> > > > > >> >> >>>>> seems >>> >>>> > > > > >> >> >>>>> >>>>>>>> like a natural place to poke. >>> >>>> > > > > >> >> >>>>> >>>>>>>> >>> >>>> > > > > >> >> >>>>> >>>>>>>> Thanks . >>> >>>> > > > > >> >> >>>>> >>>>>>>> >>> >>>> > > > > >> >> >>>>> >>>>>>>> -Dmitriy >>> >>>> > > > > >> >> >>>>> >>>>>>> >>> >>>> > > > > >> >> >>>>> >>>>>>> >>> >>>> > > > > >> >> >>>>> >>>>>>> >>> >>>> > > > > >> >> >>>>> >>>>>>> -- >>> >>>> > > > > >> >> >>>>> >>>>>>> Director of Data Science >>> >>>> > > > > >> >> >>>>> >>>>>>> Cloudera >>> >>>> > > > > >> >> >>>>> >>>>>>> Twitter: @josh_wills >>> >>>> > > > > >> >> >>>>> >>> >>> >>>> > > > > >> >> >>>>> >>> >>> >>>> > > > > >> >> >>>>> >>> >>> >>>> > > > > >> >> >>>>> >> >>> >>>> > > > > >> >> >>>>> >>> >>>> > > > > >> >> >>>> >>> >>>> > > > > >> >> >>>> >>> >>>> > > > > >> >> >>>> >>> >>>> > > > > >> >> >>>> -- >>> >>>> > > > > >> >> >>>> Director of Data Science >>> >>>> > > > > >> >> >>>> Cloudera <http://www.cloudera.com> >>> >>>> > > > > >> >> >>>> Twitter: @josh_wills < >>> http://twitter.com/josh_wills> >>> >>>> > > > > >> >> >>> >>>> > > > > >> > >>> >>>> > > > > >> > >>> >>>> > > > > >> > >>> >>>> > > > > >> > -- >>> >>>> > > > > >> > Director of Data Science >>> >>>> > > > > >> > Cloudera <http://www.cloudera.com> >>> >>>> > > > > >> > Twitter: @josh_wills <http://twitter.com/josh_wills> >>> >>>> > > > > >> >>> >>>> > > > > > >>> >>>> > > > > > >>> >>>> > > > > > >>> >>>> > > > > > -- >>> >>>> > > > > > Director of Data Science >>> >>>> > > > > > Cloudera <http://www.cloudera.com> >>> >>>> > > > > > Twitter: @josh_wills <http://twitter.com/josh_wills> >>> >>>> > > > > >>> >>>> > > > >>> >>>> > > >>> >>>> > >>> >>>> >>> >>>> >>> >>>> >>> >>>> -- >>> >>>> Director of Data Science >>> >>>> Cloudera <http://www.cloudera.com> >>> >>>> Twitter: @josh_wills <http://twitter.com/josh_wills> >>> >>>> >>> >>> >>> >>> >>> >> >>> >>> >>> >>> -- >>> Director of Data Science >>> Cloudera >>> Twitter: @josh_wills >>> >> >> >
