GitHub user DonnyZone opened a pull request:

    https://github.com/apache/spark/pull/19175

    [SPARK-21964][SQL]Enable splitting the Aggregate (on Expand) into a number 
of Aggregates for grouing analytics

    ## What changes were proposed in this pull request?
    
    https://issues.apache.org/jira/browse/SPARK-21964
    
    Current implementation for grouping analytics (i.e., cube, rollup and 
grouping sets) is "heavyweight" for many scenarios (e.g., high dimensions 
cube), as the Expand operator produces a large number of projections, resulting 
vast shuffle write.  It may result into low performance or even OOM issues for 
direct buffer memory.
    
    This PR provides another choice which enables splitting the heavyweight 
aggregate into a number of lightweight aggregates for each group. Actually, it 
implements the grouping analytics as Union and executes the aggregates one by 
one. This optimization is opposite to the general sense of "avoding redundant 
data scan"
    
    The current splitting strategy is simple as one aggregation for one group. 
In future, we may figure out more intelligent splitting stategies (e.g., 
cost-based method).
    
    ## How was this patch tested?
    
    Unit tests
    Manual tests in production environment


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/DonnyZone/spark spark-21964

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19175.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19175
    
----
commit da36e37df9c31901975c29dfa77cb7d648e94f40
Author: donnyzone <[email protected]>
Date:   2017-09-09T13:46:48Z

    grouping with union

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to