[jira] [Commented] (CRUNCH-296) Support new distributed execution engines (e.g., Spark)

Josh Wills (JIRA) Mon, 18 Nov 2013 10:03:46 -0800

    [ 
https://issues.apache.org/jira/browse/CRUNCH-296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13825538#comment-13825538
 ]


Josh Wills commented on CRUNCH-296:
-----------------------------------

Thanks Tom. My goal for the project is to be useful to MapReduce developers, 
and I suspect that many MapReduce developers are going to become Tez/Spark 
developers in the coming years. I think that anything we can do to smooth those 
transitions and ensure that they can easily select the right framework for the 
job at hand is a worthwhile goal for this community.

> Support new distributed execution engines (e.g., Spark)
> -------------------------------------------------------
>
>                 Key: CRUNCH-296
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-296
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Josh Wills
>            Assignee: Josh Wills
>         Attachments: CRUNCH-296.patch
>
>
> I've been working on this off-and-on for awhile, but it's currently in a 
> state where I feel like it's worth sharing: I came up with an implementation 
> of the Crunch APIs that runs on top of Apache Spark instead of MapReduce.
> My goal for this is pretty simple; I want to be able to change any instances 
> of "new MRPipeline(...)" to "new SparkPipeline(...)", not change anything 
> else at all, and have my pipelines run on Spark instead of as a series of MR 
> jobs. Turns out that we can pretty much do exactly that. Not everything works 
> yet, but lots of things do-- joins and cogroups work, the PageRank and TfIdf 
> integration tests work. Some things that do not work that I'm aware of: 
> in-memory joins and some of the more complex file output handling rules, but 
> I believe that these things are fixable. Some thing that might work or might 
> not: HBase inputs and outputs on top of Spark.
> This is just an idea I had, and I would understand if other people don't want 
> to work on this or don't think it's the right direction for the project. My 
> minimal request would be to include the refactoring of the core APIs 
> necessary to support plugging in new execution frameworks so I can keep 
> working on this stuff.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (CRUNCH-296) Support new distributed execution engines (e.g., Spark)

Reply via email to