[jira] [Commented] (CRUNCH-296) Support new distributed execution engines (e.g., Spark)

Gabriel Reid (JIRA) Tue, 10 Dec 2013 06:01:30 -0800

    [ 
https://issues.apache.org/jira/browse/CRUNCH-296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844284#comment-13844284
 ]


Gabriel Reid commented on CRUNCH-296:
-------------------------------------

Looks good to me in general, although I'm running into one small issue -- it's 
not compiling for me directly under maven (jdk 1.7.0_40 on Mac OS X), although 
it is fine compiling within Eclipse.

The issue is line 42 in BaseDoTable.java. Replacing that line with 
{code}
    return (CombineFn) fn;
{code} 
resolves the issue for me, although looking at that change that's required to 
get it to compile and the generics info that's being thrown away, I wonder if 
there is something else to worry about there.

At first I wasn't so sure about the new cache() methods on PCollection, but 
thinking about it more I think it's actually even a more logical naming for the 
way that materialize() is currently used to cache results of a computation on 
disk, so I'm all for it.



> Support new distributed execution engines (e.g., Spark)
> -------------------------------------------------------
>
>                 Key: CRUNCH-296
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-296
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Josh Wills
>            Assignee: Josh Wills
>         Attachments: CRUNCH-296.patch, CRUNCH-296b.patch, CRUNCH-296c.patch, 
> CRUNCH-296d.patch, CRUNCH-296d.patch
>
>
> I've been working on this off-and-on for awhile, but it's currently in a 
> state where I feel like it's worth sharing: I came up with an implementation 
> of the Crunch APIs that runs on top of Apache Spark instead of MapReduce.
> My goal for this is pretty simple; I want to be able to change any instances 
> of "new MRPipeline(...)" to "new SparkPipeline(...)", not change anything 
> else at all, and have my pipelines run on Spark instead of as a series of MR 
> jobs. Turns out that we can pretty much do exactly that. Not everything works 
> yet, but lots of things do-- joins and cogroups work, the PageRank and TfIdf 
> integration tests work. Some things that do not work that I'm aware of: 
> in-memory joins and some of the more complex file output handling rules, but 
> I believe that these things are fixable. Some thing that might work or might 
> not: HBase inputs and outputs on top of Spark.
> This is just an idea I had, and I would understand if other people don't want 
> to work on this or don't think it's the right direction for the project. My 
> minimal request would be to include the refactoring of the core APIs 
> necessary to support plugging in new execution frameworks so I can keep 
> working on this stuff.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Commented] (CRUNCH-296) Support new distributed execution engines (e.g., Spark)

Reply via email to