Github user mgaido91 commented on the issue:
https://github.com/apache/spark/pull/21498
> Because they have same partitioning, for example, I suppose that first
partitions of all RDDs are located at the same place?
I really don't think so.
In aggregation we are replacing a needed shuffle with gathering only the
needed rows from the other partitions. Here we are _always_ gathering the
needed rows for maintaining the partitioning in order to avoid a _possible_
shuffle which may occur later. I agree that in such a situation this is an
improvement, but in case a shuffle is not needed after the union I think we can
have a performance regression.
Probably we can wait for others' opinion, but it would be also great to
have some performance tests on both cases and different scenarios in order to
better evaluate this change. What do you think?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]