GitHub user nkronenfeld opened a pull request:

    https://github.com/apache/spark/pull/5565

    Common interfaces between RDD, DStream, and DataFrame

    This PR is the beginning of an attempt to put a common interface between 
the main distributed collection types in Spark.
    
    I've tried to make this first checkin towards such a common interface as 
simple as possible.  To this end, I've taken RDDApi from the sql project, 
pulled it up into core (as RDDLike), and changed it as necessary to allow all 
three main distributed collection classes to implement it.
    
    I've then done something similar for pair methods, between RDD and DStream 
(I don't think there is an equivalent for DataFrames)
    
    This involves a few small interface changes - things like reduceByKey 
having different method signatures in different classes - but they are, for the 
moment, minor.  That being said, they are still interface changes, and I don't 
expect this to get merged in without discussion.  So - suggestions and help are 
welcome, encouraged, etc.
    
    In the very near future, if this PR is accepted, I would like to expand on 
it in a few simple ways:
    
    * I want to try to pull more functions up into this interface
    * There are a lot of functions with 3 versions:
      * foo(...)
      * foo(..., numPartitions: Int)
      * foo(..., partitioner: Partitioner)
    
      These should all be replaceble by 
    
      * foo(..., partitioner: Partitioner = defaultPartitioner)
    
      with the implicit Int => Partitioner conversion I've put in here.  I did 
half of this reduction, in once case (reduceByKey) out of necessity, trying to 
get the implementation contained herein to compile, but extending it as far as 
possible would make a lot of things much cleaner.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/nkronenfeld/spark-1 feature/common-interface2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/5565.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5565
    
----
commit 9dbbd9ea0e69fd0d5fc5056aeabe4f7efc842cee
Author: Nathan Kronenfeld <[email protected]>
Date:   2015-04-17T20:23:26Z

    Common interface between RDD, DStream, DataFrame - non-pair methods

commit fb920ffc6e30897e19626f6556af3f0ffc5248bb
Author: Nathan Kronenfeld <[email protected]>
Date:   2015-04-17T22:02:20Z

    Common interface for PairRDD functions

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to