[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...

cloud-fan Tue, 12 Jun 2018 11:49:11 -0700

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21498
  
    In theory, this should be done in a cost-based style. Changing the way how 
union combines data will reduce the parallelism.
    
    For example, if we union 2 tables each has 5 partitions. Without this PR we 
will launch 10 tasks to process the data, and locality should be easy to 
satisfy. After this PR, we only launch 5 tasks, and locality is hard to meet, 
we may have extra data transfer.
    
    We should move statistics to physical plan first. cc @wzhfy



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...

Reply via email to