GitHub user hvanhovell opened a pull request:
https://github.com/apache/spark/pull/9406
[SPARK-9241][SQL] Supporting multiple DISTINCT columns (2) - Rewriting Rule
The second PR for SPARK-9241, this adds support for multiple distinct
columns to the new aggregation code path.
This PR solves the multiple DISTINCT column problem by rewriting these
Aggregates into an Expand-Aggregate-Aggregate combination. See the [JIRA
ticket](https://issues.apache.org/jira/browse/SPARK-9241) for some information
on this. The advantages over the - competing - [first
PR](https://github.com/apache/spark/pull/9280) are:
- This can use the faster TungstenAggregate code path.
- It is impossible to OOM due to an ```OpenHashSet``` allocating to much
memory. However, this will multiply the number of input rows by the number of
distinct clauses (plus one), and puts a lot more memory pressure on the
aggregation code path itself.
The location of this Rule is a bit funny, and should probably change when
the old aggregation path is changed.
cc @yhuai - Could you also tell me where to add tests for this?
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/hvanhovell/spark SPARK-9241-rewriter
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/9406.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #9406
----
commit 9fd77c64165f79d17a12613afee89e51afaf5e00
Author: Herman van Hovell <[email protected]>
Date: 2015-11-02T07:15:55Z
rebase
commit 6139f473d4edec53bcbe47008832f67f3ef567fb
Author: Herman van Hovell <[email protected]>
Date: 2015-11-02T09:30:52Z
Fix a few small bugs.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]