GitHub user DonnyZone opened a pull request:
https://github.com/apache/spark/pull/19175
[SPARK-21964][SQL]Enable splitting the Aggregate (on Expand) into a number
of Aggregates for grouing analytics
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-21964
Current implementation for grouping analytics (i.e., cube, rollup and
grouping sets) is "heavyweight" for many scenarios (e.g., high dimensions
cube), as the Expand operator produces a large number of projections, resulting
vast shuffle write. It may result into low performance or even OOM issues for
direct buffer memory.
This PR provides another choice which enables splitting the heavyweight
aggregate into a number of lightweight aggregates for each group. Actually, it
implements the grouping analytics as Union and executes the aggregates one by
one. This optimization is opposite to the general sense of "avoding redundant
data scan"
The current splitting strategy is simple as one aggregation for one group.
In future, we may figure out more intelligent splitting stategies (e.g.,
cost-based method).
## How was this patch tested?
Unit tests
Manual tests in production environment
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/DonnyZone/spark spark-21964
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19175.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19175
----
commit da36e37df9c31901975c29dfa77cb7d648e94f40
Author: donnyzone <[email protected]>
Date: 2017-09-09T13:46:48Z
grouping with union
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]