[
https://issues.apache.org/jira/browse/SPARK-5750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14336795#comment-14336795
]
Ilya Ganelin commented on SPARK-5750:
-------------------------------------
Did you have a particular doc in mind to update? I feel like this sort of
comment should go in the programming guide but there's not really a good spot
for it. One glaring omission in the guide is a general writeup of the shuffle
operation and the role that it plays internally. Understanding shuffles is key
to writing stable Spark applications yet there isn't really any mention of it
outside of the tech talks and presentations from the Spark folks. My suggestion
would be to create a section providing an overview of shuffle, what parameters
influence its behavior and stability, and then add this comment to that
section.
> Document that ordering of elements in shuffled partitions is not
> deterministic across runs
> ------------------------------------------------------------------------------------------
>
> Key: SPARK-5750
> URL: https://issues.apache.org/jira/browse/SPARK-5750
> Project: Spark
> Issue Type: Improvement
> Components: Documentation
> Reporter: Josh Rosen
>
> The ordering of elements in shuffled partitions is not deterministic across
> runs. For instance, consider the following example:
> {code}
> val largeFiles = sc.textFile(...)
> val airlines = largeFiles.repartition(2000).cache()
> println(airlines.first)
> {code}
> If this code is run twice, then each run will output a different result.
> There is non-determinism in the shuffle read code that accounts for this:
> Spark's shuffle read path processes blocks as soon as they are fetched Spark
> uses
> [ShuffleBlockFetcherIterator|https://github.com/apache/spark/blob/v1.2.1/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala]
> to fetch shuffle data from mappers. In this code, requests for multiple
> blocks from the same host are batched together, so nondeterminism in where
> tasks are run means that the set of requests can vary across runs. In
> addition, there's an [explicit
> call|https://github.com/apache/spark/blob/v1.2.1/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L256]
> to randomize the order of the batched fetch requests. As a result, shuffle
> operations cannot be guaranteed to produce the same ordering of the elements
> in their partitions.
> Therefore, Spark should update its docs to clarify that the ordering of
> elements in shuffle RDDs' partitions is non-deterministic. Note, however,
> that the _set_ of elements in each partition will be deterministic: if we
> used {{mapPartitions}} to sort each partition, then the {{first()}} call
> above would produce a deterministic result.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]