Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/21498 > Because they have same partitioning, for example, I suppose that first partitions of all RDDs are located at the same place? I really don't think so. In aggregation we are replacing a needed shuffle with gathering only the needed rows from the other partitions. Here we are _always_ gathering the needed rows for maintaining the partitioning in order to avoid a _possible_ shuffle which may occur later. I agree that in such a situation this is an improvement, but in case a shuffle is not needed after the union I think we can have a performance regression. Probably we can wait for others' opinion, but it would be also great to have some performance tests on both cases and different scenarios in order to better evaluate this change. What do you think?
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org