[GitHub] spark issue #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD could l...

felixcheung Sun, 28 Jan 2018 23:15:08 -0800

Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/20414
  
    > Actually for the first case, you shall use coalesce() instead of 
repartition() to get a similar effect, without need of another shuffle! 
    Not quite - coalesce will not combine partitions across executor (aka 
shuffle) so you could still end up having many many files.
    
    I have seen that quite a bit with large scale ML. But FWIW, my comment 
earlier was for both "regular" use cases and ML use cases.




---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD could l...

Reply via email to