GitHub user jiangxb1987 opened a pull request:

    https://github.com/apache/spark/pull/21698

    [SPARK-23243] Fix RDD.repartition() data correctness issue

    ## What changes were proposed in this pull request?
    
    The RDD repartition uses a round-robin way to distribute data, thus there 
may be data correctness issue if only a sub-set of partitions are recomputed on 
fetch failure and the input data sequence is not deterministic.
    
    The RDD data type may be not sortable, so we cannot resolve the whole issue 
by insert a local sort before shuffle (while we shall still provide that 
solution as a optional choice for those RDDs with sortable data type). The 
approach proposed in this PR is to always recompute all the partitions before 
shuffle on fetch failure, thus we don't rely on certain input data sequence.
    
    This feather is flagged behind the config 
`spark.shuffle.recomputeAllPartitionsOnRepartitionFailure`, with the feather on 
you may observe a higher risk of job falling due to reach max consequence stage 
failure limit, esp. for large jobs running on a big cluster.
    
    ## How was this patch tested?
    
    TBD
    - Add unit tests.
    - Integration test on my own cluster.
    - Benchmark to ensure no performance regression when fetch failure doesn't 
happen.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jiangxb1987/spark fix-rdd-repartition

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21698.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21698
    
----
commit b14cd4bfc1b4ba7c0810665bcda87450b9fdff99
Author: Xingbo Jiang <xingbo.jiang@...>
Date:   2018-07-02T12:19:44Z

    fix rdd repartition

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to