GitHub user viirya opened a pull request:
https://github.com/apache/spark/pull/21498
[SPARK-24410][SQL][Core][WIP] Optimization for Union outputPartitioning
## What changes were proposed in this pull request?
Currently `Union` has only unknown output partitioning. That said if the
children are bucketed tables, we still run shuffling on the union result if
going to run aggregation on it.
This patch tries to better decide output partitioning for `Union` operator.
This patch adds a private API `zipPartitions` to `RDD`. Since
`zipPartitions` asks a function to run on elements of rdds, it only supports
fixed number of rdds. But for `Union`, the number of children is not fixed.
## How was this patch tested?
TBD.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/viirya/spark-1 SPARK-24410
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21498.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21498
----
commit 9f25bd19c4802a59039cef6f006f6c6e0802e01d
Author: Liang-Chi Hsieh <viirya@...>
Date: 2018-06-05T10:05:27Z
Optimization for Union outputPartitioning.
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]