Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/20414
@shivaram Thinking more, this might affect everything which does a zip (or
variants/similar idioms like limit K, etc) on partition should be affected -
with random + index in coalesce + shuffle=true being one special case.
Essentially anything which assumes that order of records in a partition
will always be the same - currently,
* Reading from an external immutable source like hdfs, etc (including
checkpoint)
* Reading from block store
* Sorted partitions
should guarantee this - others need not.
The more I think about it, I like @sameeragarwal's suggestion in #20393, a
general solution for this could be introduce deterministic output for shuffle
fetch - when enabled takes a more expensive but repeatable iteration of shuffle
fetch.
This assumes that spark shuffle is always repeatable given same input (I am
yet to look into this in detail when spills are involved - any thoughts
@sameeragarwal ?), which could be an implementation detail; but we could make
it a requirement for shuffle.
Note that we might be able to avoid this additional cost for most of the
current usecases (otherwise we would have faced this problem 2 major releases
ago !); so actual user impact, hopefully, might not be as high.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]