Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21698
  
    IMO RDD as a distributed data set, it should not guarantee any record order 
unless you sort it. So user functions and Spark internal functions should not 
expect a specific record order.
    
    However, the round robin partitioner violates it. If the record order 
changes during retry, we may get wrong result. That's why we should fix 
`repartition` but not something else.
    
    I agree with @mridulm that this may introduce a big perf penalty. But when 
a repartition task fails, we should pay the cost to get the corrected result, 
instead of producing wrong result and asking users to deal with it themselves.
    
    I feel this is a better solution than the sort one. We should only pay the 
cost when we really need to, i.e. when the failure happens.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to