GitHub user hvanhovell opened a pull request:

    https://github.com/apache/spark/pull/9406

    [SPARK-9241][SQL] Supporting multiple DISTINCT columns (2) - Rewriting Rule

    The second PR for SPARK-9241, this adds support for multiple distinct 
columns to the new aggregation code path.
    
    This PR solves the multiple DISTINCT column problem by rewriting these 
Aggregates into an Expand-Aggregate-Aggregate combination. See the [JIRA 
ticket](https://issues.apache.org/jira/browse/SPARK-9241) for some information 
on this. The advantages over the - competing - [first 
PR](https://github.com/apache/spark/pull/9280) are:
    - This can use the faster TungstenAggregate code path.
    - It is impossible to OOM due to an ```OpenHashSet``` allocating to much 
memory. However, this will multiply the number of input rows by the number of 
distinct clauses (plus one), and puts a lot more memory pressure on the 
aggregation code path itself.
    
    The location of this Rule is a bit funny, and should probably change when 
the old aggregation path is changed.
    
    cc @yhuai - Could you also tell me where to add tests for this?

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/hvanhovell/spark SPARK-9241-rewriter

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/9406.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #9406
    
----
commit 9fd77c64165f79d17a12613afee89e51afaf5e00
Author: Herman van Hovell <[email protected]>
Date:   2015-11-02T07:15:55Z

    rebase

commit 6139f473d4edec53bcbe47008832f67f3ef567fb
Author: Herman van Hovell <[email protected]>
Date:   2015-11-02T09:30:52Z

    Fix a few small bugs.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to