Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/20414
@jiangxb1987 Unfortunately I am unable to analyze this in detail; but
hopefully can give some pointers, which I hope, helps !
One example I can think of is, for shuffle which uses Aggregator (like
combineByKey), via ExternalAppendOnlyMap.
The order in which we replay the keys with the same hash is non
deterministic from what I remember - for example if first run did not result in
any spills, second run had 3 spills and third run had 7, the order of keys
(with same hash) could be different in each.
Similarly, with sort based shuffle, depending on the length of the data
array in AppendOnlyMap (which is determined by whether we spilt or not) we can
get different sort order's ?
Similarly for the actual sort itself, the `merge` quite clearly is
sensitive to number of spills (for example when no aggregator or ordering, it
is simply `iterators.iterator.flatten`).
There might be other cases where this is happening - I have not regularly
looked at this part of the codebase in a while now unfortunately.
Please note that all the cases above, there is no ordering defined.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]