[
https://issues.apache.org/jira/browse/SPARK-5750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Josh Rosen updated SPARK-5750:
------------------------------
Summary: Document that ordering of elements in shuffled partitions is not
deterministic across runs (was: Document that ordering of elements in
post-shuffle partitions is not deterministic across runs)
> Document that ordering of elements in shuffled partitions is not
> deterministic across runs
> ------------------------------------------------------------------------------------------
>
> Key: SPARK-5750
> URL: https://issues.apache.org/jira/browse/SPARK-5750
> Project: Spark
> Issue Type: Improvement
> Components: Documentation
> Reporter: Josh Rosen
>
> The ordering of elements in shuffled partitions is not deterministic across
> runs. For instance, consider the following example:
> {code}
> val largeFiles = sc.textFile(...)
> val airlines = largeFiles.repartition(2000).cache()
> println(airlines.first)
> {code}
> If this code is run twice, then each run will output a different result.
> There is non-determinism in the shuffle read code that accounts for this:
> Spark's shuffle read path processes blocks as soon as they are fetched Spark
> uses
> [ShuffleBlockFetcherIterator|https://github.com/apache/spark/blob/v1.2.1/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala]
> to fetch shuffle data from mappers. In this code, requests for multiple
> blocks from the same host are batched together, so nondeterminism in where
> tasks are run means that the set of requests can vary across runs. In
> addition, there's an [explicit
> call|https://github.com/apache/spark/blob/v1.2.1/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L256]
> to randomize the order of the batched fetch requests. As a result, shuffle
> operations cannot be guaranteed to produce the same ordering of the elements
> in their partitions.
> Therefore, Spark should update its docs to clarify that the ordering of
> elements in shuffle RDDs' partitions is non-deterministic. Note, however,
> that the _set_ of elements in each partition will be deterministic: if we
> used {{mapPartitions}} to sort each partition, then the {{first()}} call
> above would produce a deterministic result.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]