GitHub user aray opened a pull request:
https://github.com/apache/spark/pull/11583
[SPARK-13749][SQL] Faster pivot implementation for many distinct values
with two phase aggregation
## What changes were proposed in this pull request?
The existing implementation of pivot translates into a single aggregation
with one aggregate per distinct pivot value. When the number of distinct pivot
values is large (say 1000+) this can get extremely slow since each input value
gets evaluated on every aggregate even though it only affects the value of one
of them.
I'm proposing an alternate strategy for when there are 10+ (somewhat
arbitrary threshold) distinct pivot values. We do two phases of aggregation. In
the first we group by the grouping columns plus the pivot column and perform
the specified aggregations (one or sometimes more). In the second aggregation
we group by the grouping columns and use the new (non public) PivotFirst
aggregate that rearranges the outputs of the first aggregation into an array
indexed by the pivot value. Finally we do a project to extract the array
entries into the appropriate output column.
## How was this patch tested?
Additional unit tests in DataFramePivotSuite and manual larger scale
testing.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/aray/spark fast-pivot
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/11583.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #11583
----
commit 75a101a0c0d1ad646ec28fdb78ebc1087eda8aaa
Author: Andrew Ray <[email protected]>
Date: 2016-02-11T20:49:36Z
sketch
commit e42cb36dfc398908dbb237bcda0736a5bc03f3c8
Author: Andrew Ray <[email protected]>
Date: 2016-02-12T16:13:08Z
Merge branch 'master' of https://github.com/apache/spark into fast-pivot
commit b65cfb25c878ceb3084b244c6de586444cdd8d27
Author: Andrew Ray <[email protected]>
Date: 2016-02-19T21:14:06Z
working version
commit adbcd1b8096f0b05676152cb9f9858f61668f78c
Author: Andrew Ray <[email protected]>
Date: 2016-02-19T22:13:14Z
Support other datatypes
commit 4b33b473fb1035b86f9c9bd75b688a339ba89a49
Author: Andrew Ray <[email protected]>
Date: 2016-02-23T04:20:46Z
working analyzer rule
commit d0b0b2f8a9f793364c3db72aba6dfde753377c69
Author: Andrew Ray <[email protected]>
Date: 2016-02-26T16:59:26Z
Some cleanup and unit tests
commit 7a662ba865d9a4b639b57250ee630aebbd6fe111
Author: Andrew Ray <[email protected]>
Date: 2016-03-07T15:23:47Z
disable partial agg since it may cause data expansion in some situations
and at best doesent do much data reduction.
commit 66e69dbe4b1b2f01927ed41f4160d9eb4615225d
Author: Andrew Ray <[email protected]>
Date: 2016-03-07T15:41:56Z
Merge branch 'master' of https://github.com/apache/spark into fast-pivot
# Conflicts:
#
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
commit b3ccf6110479cf1ca98874ab2165e36da86985e4
Author: Andrew Ray <[email protected]>
Date: 2016-03-08T15:26:08Z
Unit tests and code restructuring
commit bc0571d539ef893cd8eb63a89c9221b6dff5a4c1
Author: Andrew Ray <[email protected]>
Date: 2016-03-08T16:14:08Z
fix decimal unit test and remove map from case class constructor to speed
things up
commit cc9f49fa1509548678aa829c7e1d74c40fa9f2c6
Author: Andrew Ray <[email protected]>
Date: 2016-03-08T17:34:54Z
support for multiple aggregations and other cleanup
commit bffc7aa9c417b3d768118bd09e242f1a31d6bed2
Author: Andrew Ray <[email protected]>
Date: 2016-03-08T17:42:30Z
cleanup import
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]