[GitHub] spark pull request #21498: [SPARK-24410][SQL][Core][WIP] Optimization for Un...

viirya Tue, 05 Jun 2018 03:15:02 -0700

GitHub user viirya opened a pull request:

    https://github.com/apache/spark/pull/21498


    [SPARK-24410][SQL][Core][WIP] Optimization for Union outputPartitioning

    ## What changes were proposed in this pull request?
    
    Currently `Union` has only unknown output partitioning. That said if the 
children are bucketed tables, we still run shuffling on the union result if 
going to run aggregation on it.
    
    This patch tries to better decide output partitioning for `Union` operator.
    
    This patch adds a private API `zipPartitions` to `RDD`. Since 
`zipPartitions` asks a function to run on elements of rdds, it only supports 
fixed number of rdds. But for `Union`, the number of children is not fixed.
    
    ## How was this patch tested?
    
    TBD.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 SPARK-24410

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21498.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21498
    
----
commit 9f25bd19c4802a59039cef6f006f6c6e0802e01d
Author: Liang-Chi Hsieh <viirya@...>
Date:   2018-06-05T10:05:27Z

    Optimization for Union outputPartitioning.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #21498: [SPARK-24410][SQL][Core][WIP] Optimization for Un...

Reply via email to