Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/21698
IMO RDD as a distributed data set, it should not guarantee any record order
unless you sort it. So user functions and Spark internal functions should not
expect a specific record order.
However, the round robin partitioner violates it. If the record order
changes during retry, we may get wrong result. That's why we should fix
`repartition` but not something else.
I agree with @mridulm that this may introduce a big perf penalty. But when
a repartition task fails, we should pay the cost to get the corrected result,
instead of producing wrong result and asking users to deal with it themselves.
I feel this is a better solution than the sort one. We should only pay the
cost when we really need to, i.e. when the failure happens.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]