[GitHub] spark issue #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD could l...

cloud-fan Sun, 28 Jan 2018 23:43:05 -0800

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/20414
  
    > Not quite - coalesce will not combine partitions across executors (aka 
shuffle) so you could still end up having many many files.
    
    I'm not sure if I follow here. For `coalesce(1)` Spark just launches a 
single task to handle all the data partitions. If the final action is saving 
file, we still have only one file at the end. Compared to `repartition(1)`, I 
think the only difference is the cost of task retry.
    
    I think we can special case `repartition(1)`, if there is only one reducer, 
we don't need to add the local sort.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD could l...

Reply via email to