[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

cloud-fan Tue, 10 Jul 2018 03:06:26 -0700

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21698
  
    IMO RDD as a distributed data set, it should not guarantee any record order 
unless you sort it. So user functions and Spark internal functions should not 
expect a specific record order.
    
    However, the round robin partitioner violates it. If the record order 
changes during retry, we may get wrong result. That's why we should fix 
`repartition` but not something else.
    
    I agree with @mridulm that this may introduce a big perf penalty. But when 
a repartition task fails, we should pay the cost to get the corrected result, 
instead of producing wrong result and asking users to deal with it themselves.
    
    I feel this is a better solution than the sort one. We should only pay the 
cost when we really need to, i.e. when the failure happens.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

Reply via email to