GitHub user cloud-fan opened a pull request:
https://github.com/apache/spark/pull/22112
[WIP][SPARK-23243][Core] Fix RDD.repartition() data correctness issue
## What changes were proposed in this pull request?
An alternative fix for https://github.com/apache/spark/pull/21698
RDD can take arbitrary user function, but we have an assumption: the
function should produce same data set for same input, but the order can change.
Spark scheduler must take care of this assumption when fetch failure
happens, otherwise we may hit correctness issue as the JIRA ticket described.
Generall speaking, when a map stage gets retried because of fetch failure,
and this map stage is not idempotent(produce same data set but different order
each time), and the shuffle partitioner is sensitive to the input data
order(like round robin partitioner), we should retry all the reduce tasks.
TODO: document and test
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/cloud-fan/spark repartition
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/22112.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #22112
----
commit 1f9f6e5b020038be1e7c11b9923010465da385aa
Author: Wenchen Fan <wenchen@...>
Date: 2018-08-15T18:38:24Z
fix repartition+shuffle bug
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]