GitHub user nkronenfeld opened a pull request:
https://github.com/apache/spark/pull/5565
Common interfaces between RDD, DStream, and DataFrame
This PR is the beginning of an attempt to put a common interface between
the main distributed collection types in Spark.
I've tried to make this first checkin towards such a common interface as
simple as possible. To this end, I've taken RDDApi from the sql project,
pulled it up into core (as RDDLike), and changed it as necessary to allow all
three main distributed collection classes to implement it.
I've then done something similar for pair methods, between RDD and DStream
(I don't think there is an equivalent for DataFrames)
This involves a few small interface changes - things like reduceByKey
having different method signatures in different classes - but they are, for the
moment, minor. That being said, they are still interface changes, and I don't
expect this to get merged in without discussion. So - suggestions and help are
welcome, encouraged, etc.
In the very near future, if this PR is accepted, I would like to expand on
it in a few simple ways:
* I want to try to pull more functions up into this interface
* There are a lot of functions with 3 versions:
* foo(...)
* foo(..., numPartitions: Int)
* foo(..., partitioner: Partitioner)
These should all be replaceble by
* foo(..., partitioner: Partitioner = defaultPartitioner)
with the implicit Int => Partitioner conversion I've put in here. I did
half of this reduction, in once case (reduceByKey) out of necessity, trying to
get the implementation contained herein to compile, but extending it as far as
possible would make a lot of things much cleaner.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/nkronenfeld/spark-1 feature/common-interface2
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/5565.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #5565
----
commit 9dbbd9ea0e69fd0d5fc5056aeabe4f7efc842cee
Author: Nathan Kronenfeld <[email protected]>
Date: 2015-04-17T20:23:26Z
Common interface between RDD, DStream, DataFrame - non-pair methods
commit fb920ffc6e30897e19626f6556af3f0ffc5248bb
Author: Nathan Kronenfeld <[email protected]>
Date: 2015-04-17T22:02:20Z
Common interface for PairRDD functions
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]