[GitHub] spark pull request: assumePartitioned

squito Sun, 08 Feb 2015 02:27:00 -0800

GitHub user squito opened a pull request:

    https://github.com/apache/spark/pull/4449


    assumePartitioned

    https://issues.apache.org/jira/browse/SPARK-1061
    
    If you partition an RDD, save to hdfs, then reload it in a separate 
SparkContext, you've lost the info that the RDD was partitioned.  This prevents 
you from getting the savings of a narrow dependency you could get.  This is 
especially painful if you've got some big dataset on hdfs, and you periodically 
get small updates that need to be joined against it.
    
    `assumePartitionedBy` lets you simply assign a partitioner to an RDD, so 
you can get your narrow dependencies back.  Its up to the application to know 
what the partitioner should be, but it will at least verify the assignment is 
OK.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/squito/spark SPARK-1061

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4449.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4449
    
----
commit e041155fc332da933f2ac22b311682331e1bc64a
Author: Imran Rashid <[email protected]>
Date:   2015-02-07T05:54:35Z

    assumePartitioned

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: assumePartitioned

Reply via email to