[
https://issues.apache.org/jira/browse/SPARK-23207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16574903#comment-16574903
]
Jiang Xingbo commented on SPARK-23207:
--------------------------------------
This affects the 2.2 and lower versions, the reason why we didn't backport the
patch is that it can cause huge perf regression to `repartition()` operation,
and chance to hit this correctness bug is small. cc [~smilegator][~sameerag]
> Shuffle+Repartition on an DataFrame could lead to incorrect answers
> -------------------------------------------------------------------
>
> Key: SPARK-23207
> URL: https://issues.apache.org/jira/browse/SPARK-23207
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.3.0
> Reporter: Jiang Xingbo
> Assignee: Jiang Xingbo
> Priority: Blocker
> Labels: correctness
> Fix For: 2.3.0
>
>
> Currently shuffle repartition uses RoundRobinPartitioning, the generated
> result is nondeterministic since the sequence of input rows are not
> determined.
> The bug can be triggered when there is a repartition call following a shuffle
> (which would lead to non-deterministic row ordering), as the pattern shows
> below:
> upstream stage -> repartition stage -> result stage
> (-> indicate a shuffle)
> When one of the executors process goes down, some tasks on the repartition
> stage will be retried and generate inconsistent ordering, and some tasks of
> the result stage will be retried generating different data.
> The following code returns 931532, instead of 1000000:
> {code}
> import scala.sys.process._
> import org.apache.spark.TaskContext
> val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x =>
> x
> }.repartition(200).map { x =>
> if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) {
> throw new Exception("pkill -f java".!!)
> }
> x
> }
> res.distinct().count()
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]