[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...

mgaido91 Wed, 06 Jun 2018 02:56:33 -0700

Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/21498
  
    @viirya I may be wrong, but I am not sure about the performance improvement 
brought by this. The goal here is to avoid a shuffle after the `union` operator 
(when it is followed by operators requiring shuffles). But this is actually 
causing the transfer of all the data (but one RDD) over the network, as it 
collapses all the partitions with the same distribution to the same one and it 
does this also when it is not needed, ie. when a shuffle is not required after. 
In this case we might have a performance regression. Am I missing something?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...

Reply via email to