GitHub user jiangxb1987 opened a pull request:
https://github.com/apache/spark/pull/21698
[SPARK-23243] Fix RDD.repartition() data correctness issue
## What changes were proposed in this pull request?
The RDD repartition uses a round-robin way to distribute data, thus there
may be data correctness issue if only a sub-set of partitions are recomputed on
fetch failure and the input data sequence is not deterministic.
The RDD data type may be not sortable, so we cannot resolve the whole issue
by insert a local sort before shuffle (while we shall still provide that
solution as a optional choice for those RDDs with sortable data type). The
approach proposed in this PR is to always recompute all the partitions before
shuffle on fetch failure, thus we don't rely on certain input data sequence.
This feather is flagged behind the config
`spark.shuffle.recomputeAllPartitionsOnRepartitionFailure`, with the feather on
you may observe a higher risk of job falling due to reach max consequence stage
failure limit, esp. for large jobs running on a big cluster.
## How was this patch tested?
TBD
- Add unit tests.
- Integration test on my own cluster.
- Benchmark to ensure no performance regression when fetch failure doesn't
happen.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/jiangxb1987/spark fix-rdd-repartition
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21698.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21698
----
commit b14cd4bfc1b4ba7c0810665bcda87450b9fdff99
Author: Xingbo Jiang <xingbo.jiang@...>
Date: 2018-07-02T12:19:44Z
fix rdd repartition
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]