[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

cloud-fan Mon, 13 Aug 2018 22:24:01 -0700

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21698
  
    I took a quick look at the shuffle writer and feel it will be hard to 
insert a sort there.
    
    I have a simpler proposal for the fix. To trigger this bug, there must be a 
shuffle before the `repartition`, queries like `sc.textFile(...).repartition` 
has no problem.
    
    We can add a flag (named `fromCoelesce`) in the `ShuffleRDD` to indicate if 
it's produced by `RDD#coalesce`. In `DAGScheduler`, if we hit a `FetchFailure`, 
fail the job if the shuffle is from `RDD#coalesce` and the previous stage is 
also a shuffle map stage. We can provide a config to turn off this check, or 
add an `RDD#repartitionBy` which uses hash partitioner instead of round-robin. 
In the error message we should mention these 2 workarounds.
    
    In the next release, we can implement the sort or the retry approach as a 
better fix.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

Reply via email to