Github user mgaido91 commented on the issue:
https://github.com/apache/spark/pull/21498
@viirya I may be wrong, but I am not sure about the performance improvement
brought by this. The goal here is to avoid a shuffle after the `union` operator
(when it is followed by operators requiring shuffles). But this is actually
causing the transfer of all the data (but one RDD) over the network, as it
collapses all the partitions with the same distribution to the same one and it
does this also when it is not needed, ie. when a shuffle is not required after.
In this case we might have a performance regression. Am I missing something?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]