Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/22112
@tgravescs:
> The shuffle simply transfers the bytes its supposed to. Sparks shuffle of
those bytes is not consistent in that the order it fetches from can change and
without the sort happening on that data the order can be different on rerun. I
guess maybe you mean the ShuffledRDD as a whole or do you mean something else
here?
By shuffle, I am referring to the output of shuffle which is be consumed by
RDD with `ShuffleDependency` as input.
More specifically, the output of
`SparkEnv.get.shuffleManager.getReader(...).read()` which RDD (user and spark
impl's) uses to fetch output of shuffle machinery.
This output will not just be shuffle bytes/deserialize, but with
aggregation applied (if specified) and ordering imposed (if specified).
ShuffledRDD is one such usage within spark core, but others exist within
spark core and in user code.
> All I'm saying is zip is just another variant of this, you could document
it as such and do nothing internal to spark to "fix it".
I agree; repartition + shuffle, zip, sample, mllib usages are all variants
of the same problem - of shuffle output order being inconsistent.
> I guess we can separate out these 2 discussions. I think the point of
this pr is to temporarily workaround the data loss/corruption issue with
repartition by failing. So if everyone agrees on that lets move the discussion
to a jira about what to do with the rest of the operators and fix repartition
here. thoughts?
Sounds good to me.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]