Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/21498
  
    > In theory, this should be done in a cost-based style. Changing the way 
how union combines data will reduce the parallelism.
    > For example, if we union 2 tables each has 5 partitions. Without this PR 
we will launch 10 tasks to process the data, and locality should be easy to 
satisfy. After this PR, we only launch 5 tasks, and locality is hard to meet, 
we may have extra data transfer.
    
    Yes, the added `ZippedPartitionsRDD` for zipping RDDs works similar to 
`PartitionerAwareUnionRDD`, the preferred location for each partition will be 
the most common preferred location for zipped partitions.
    
    If we can have a solution which can be smarter so that we can make better 
choice between shuffle and locality/parallelism.
    
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to