GitHub user aray opened a pull request:

    https://github.com/apache/spark/pull/11583

    [SPARK-13749][SQL] Faster pivot implementation for many distinct values 
with two phase aggregation

    ## What changes were proposed in this pull request?
    
    The existing implementation of pivot translates into a single aggregation 
with one aggregate per distinct pivot value. When the number of distinct pivot 
values is large (say 1000+) this can get extremely slow since each input value 
gets evaluated on every aggregate even though it only affects the value of one 
of them.
    
    I'm proposing an alternate strategy for when there are 10+ (somewhat 
arbitrary threshold) distinct pivot values. We do two phases of aggregation. In 
the first we group by the grouping columns plus the pivot column and perform 
the specified aggregations (one or sometimes more). In the second aggregation 
we group by the grouping columns and use the new (non public) PivotFirst 
aggregate that rearranges the outputs of the first aggregation into an array 
indexed by the pivot value. Finally we do a project to extract the array 
entries into the appropriate output column.
    
    ## How was this patch tested?
    
    Additional unit tests in DataFramePivotSuite and manual larger scale 
testing.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/aray/spark fast-pivot

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11583.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11583
    
----
commit 75a101a0c0d1ad646ec28fdb78ebc1087eda8aaa
Author: Andrew Ray <[email protected]>
Date:   2016-02-11T20:49:36Z

    sketch

commit e42cb36dfc398908dbb237bcda0736a5bc03f3c8
Author: Andrew Ray <[email protected]>
Date:   2016-02-12T16:13:08Z

    Merge branch 'master' of https://github.com/apache/spark into fast-pivot

commit b65cfb25c878ceb3084b244c6de586444cdd8d27
Author: Andrew Ray <[email protected]>
Date:   2016-02-19T21:14:06Z

    working version

commit adbcd1b8096f0b05676152cb9f9858f61668f78c
Author: Andrew Ray <[email protected]>
Date:   2016-02-19T22:13:14Z

    Support other datatypes

commit 4b33b473fb1035b86f9c9bd75b688a339ba89a49
Author: Andrew Ray <[email protected]>
Date:   2016-02-23T04:20:46Z

    working analyzer rule

commit d0b0b2f8a9f793364c3db72aba6dfde753377c69
Author: Andrew Ray <[email protected]>
Date:   2016-02-26T16:59:26Z

    Some cleanup and unit tests

commit 7a662ba865d9a4b639b57250ee630aebbd6fe111
Author: Andrew Ray <[email protected]>
Date:   2016-03-07T15:23:47Z

    disable partial agg since it may cause data expansion in some situations 
and at best doesent do much data reduction.

commit 66e69dbe4b1b2f01927ed41f4160d9eb4615225d
Author: Andrew Ray <[email protected]>
Date:   2016-03-07T15:41:56Z

    Merge branch 'master' of https://github.com/apache/spark into fast-pivot
    
    # Conflicts:
    #   
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

commit b3ccf6110479cf1ca98874ab2165e36da86985e4
Author: Andrew Ray <[email protected]>
Date:   2016-03-08T15:26:08Z

    Unit tests and code restructuring

commit bc0571d539ef893cd8eb63a89c9221b6dff5a4c1
Author: Andrew Ray <[email protected]>
Date:   2016-03-08T16:14:08Z

    fix decimal unit test and remove map from case class constructor to speed 
things up

commit cc9f49fa1509548678aa829c7e1d74c40fa9f2c6
Author: Andrew Ray <[email protected]>
Date:   2016-03-08T17:34:54Z

    support for multiple aggregations and other cleanup

commit bffc7aa9c417b3d768118bd09e242f1a31d6bed2
Author: Andrew Ray <[email protected]>
Date:   2016-03-08T17:42:30Z

    cleanup import

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to