GitHub user jiangxb1987 opened a pull request: https://github.com/apache/spark/pull/21698
[SPARK-23243] Fix RDD.repartition() data correctness issue ## What changes were proposed in this pull request? The RDD repartition uses a round-robin way to distribute data, thus there may be data correctness issue if only a sub-set of partitions are recomputed on fetch failure and the input data sequence is not deterministic. The RDD data type may be not sortable, so we cannot resolve the whole issue by insert a local sort before shuffle (while we shall still provide that solution as a optional choice for those RDDs with sortable data type). The approach proposed in this PR is to always recompute all the partitions before shuffle on fetch failure, thus we don't rely on certain input data sequence. This feather is flagged behind the config `spark.shuffle.recomputeAllPartitionsOnRepartitionFailure`, with the feather on you may observe a higher risk of job falling due to reach max consequence stage failure limit, esp. for large jobs running on a big cluster. ## How was this patch tested? TBD - Add unit tests. - Integration test on my own cluster. - Benchmark to ensure no performance regression when fetch failure doesn't happen. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jiangxb1987/spark fix-rdd-repartition Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21698.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21698 ---- commit b14cd4bfc1b4ba7c0810665bcda87450b9fdff99 Author: Xingbo Jiang <xingbo.jiang@...> Date: 2018-07-02T12:19:44Z fix rdd repartition ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org