[jira] [Commented] (CRUNCH-296) Support new distributed execution engines (e.g., Spark)

Tom White (JIRA) Mon, 18 Nov 2013 08:57:51 -0800

    [ 
https://issues.apache.org/jira/browse/CRUNCH-296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13825474#comment-13825474
 ]


Tom White commented on CRUNCH-296:
----------------------------------

This sounds like a great addition!

Regarding whether it fits in the project, I think it does. MapReduce is the 
workhorse, and I can't see it going away, but Spark and Tez (both in the Apache 
Incubator) can be more efficient for certain types of pipelines, so it makes 
sense to support them as alternative execution engines. For comparison, work is 
currently underway to make Hive and Pig both take advantage of the more 
flexible DAGs that Tez supports, so it's natural to do something similar in 
Crunch.

> Support new distributed execution engines (e.g., Spark)
> -------------------------------------------------------
>
>                 Key: CRUNCH-296
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-296
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Josh Wills
>            Assignee: Josh Wills
>         Attachments: CRUNCH-296.patch
>
>
> I've been working on this off-and-on for awhile, but it's currently in a 
> state where I feel like it's worth sharing: I came up with an implementation 
> of the Crunch APIs that runs on top of Apache Spark instead of MapReduce.
> My goal for this is pretty simple; I want to be able to change any instances 
> of "new MRPipeline(...)" to "new SparkPipeline(...)", not change anything 
> else at all, and have my pipelines run on Spark instead of as a series of MR 
> jobs. Turns out that we can pretty much do exactly that. Not everything works 
> yet, but lots of things do-- joins and cogroups work, the PageRank and TfIdf 
> integration tests work. Some things that do not work that I'm aware of: 
> in-memory joins and some of the more complex file output handling rules, but 
> I believe that these things are fixable. Some thing that might work or might 
> not: HBase inputs and outputs on top of Spark.
> This is just an idea I had, and I would understand if other people don't want 
> to work on this or don't think it's the right direction for the project. My 
> minimal request would be to include the refactoring of the core APIs 
> necessary to support plugging in new execution frameworks so I can keep 
> working on this stuff.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (CRUNCH-296) Support new distributed execution engines (e.g., Spark)

Reply via email to