[GitHub] spark pull request: SPARK-1770: Load balance elements when reparti...

pwendell Sat, 10 May 2014 18:26:12 -0700

GitHub user pwendell opened a pull request:

    https://github.com/apache/spark/pull/727


    SPARK-1770: Load balance elements when repartitioning.

    This patch adds better balancing when performing a repartition of an
    RDD. Previously the elements in the RDD were hash partitioned, meaning
    if the RDD was skewed certain partitions would end up being very large.
    
    This commit adds load balancing of elements across the repartitioned
    RDD splits. The load balancing is not perfect: a given output partition
    can have up to N more elements than the average if there are N input
    partitions. However, some randomization is used to minimize the
    probabiliy that this happens.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/pwendell/spark load-balance

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/727.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #727
    
----
commit acfa46aad3140b7d10890b15f13519db684cd2b7
Author: Patrick Wendell <[email protected]>
Date:   2014-05-11T00:59:13Z

    SPARK-1770: Load balance elements when repartitioning.
    
    This patch adds better balancing when performing a repartition of an
    RDD. Previously the elements in the RDD were hash partitioned, meaning
    if the RDD was skewed certain partitions would end up being very large.
    
    This commit adds load balancing of elements across the repartitioned
    RDD splits. The load balancing is not perfect: a given output partition
    can have up to N more elements than the average if there are N input
    partitions. However, some randomization is used to minimize the
    probabiliy that this happens.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1770: Load balance elements when reparti...

Reply via email to