[
https://issues.apache.org/jira/browse/CRUNCH-296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Josh Wills resolved CRUNCH-296.
-------------------------------
Resolution: Fixed
Fix Version/s: 0.9.0
Ridiculously huge patch committed.
> Support new distributed execution engines (e.g., Spark)
> -------------------------------------------------------
>
> Key: CRUNCH-296
> URL: https://issues.apache.org/jira/browse/CRUNCH-296
> Project: Crunch
> Issue Type: Improvement
> Components: Core
> Reporter: Josh Wills
> Assignee: Josh Wills
> Fix For: 0.9.0
>
> Attachments: CRUNCH-296.patch, CRUNCH-296b.patch, CRUNCH-296c.patch,
> CRUNCH-296d.patch, CRUNCH-296d.patch
>
>
> I've been working on this off-and-on for awhile, but it's currently in a
> state where I feel like it's worth sharing: I came up with an implementation
> of the Crunch APIs that runs on top of Apache Spark instead of MapReduce.
> My goal for this is pretty simple; I want to be able to change any instances
> of "new MRPipeline(...)" to "new SparkPipeline(...)", not change anything
> else at all, and have my pipelines run on Spark instead of as a series of MR
> jobs. Turns out that we can pretty much do exactly that. Not everything works
> yet, but lots of things do-- joins and cogroups work, the PageRank and TfIdf
> integration tests work. Some things that do not work that I'm aware of:
> in-memory joins and some of the more complex file output handling rules, but
> I believe that these things are fixable. Some thing that might work or might
> not: HBase inputs and outputs on top of Spark.
> This is just an idea I had, and I would understand if other people don't want
> to work on this or don't think it's the right direction for the project. My
> minimal request would be to include the refactoring of the core APIs
> necessary to support plugging in new execution frameworks so I can keep
> working on this stuff.
--
This message was sent by Atlassian JIRA
(v6.1.4#6159)