Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/21498
  
    > Because they have same partitioning, for example, I suppose that first 
partitions of all RDDs are located at the same place?
    
    I really don't think so.
    
    In aggregation we are replacing a needed shuffle with gathering only the 
needed rows from the other partitions. Here we are _always_ gathering the 
needed rows for maintaining the partitioning in order to avoid a _possible_  
shuffle which may occur later. I agree that in such a situation this is an 
improvement, but in case a shuffle is not needed after the union I think we can 
have a performance regression.
    
    Probably we can wait for others' opinion, but it would be also great to 
have some performance tests on both cases and different scenarios in order to 
better evaluate this change. What do you think?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to