Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/21698
I did not go over the PR itself in detail, but the proposal sounds very
expensive - particularly given the cascading costs involved.
Also, I am not sure why we are special case'ing only coalasce/repartition
here : any closure which is depending on ordering of tuples is bound to fail -
for example, RDD.zip* variants, sampling in ML etc will suffer from same issue.
Either we fix shuffle itself to become deterministic (which I am not sure
if we can do efficiently), or we could simply document this issue with
coalasce/other relevant api - so that users do a sort when applicable : when
they deem the additional cost is required to be borne.
Note that in a lot of cases, this is not an issue - for example when
reading from external data stores, checkpointed data, persisted data, etc :
which typically are reasons why coalasce gets used a lot (to minimize number of
partitions).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]