[GitHub] spark pull request: SPARK-13335 Use declarative aggregate for coll...

mccheah Sun, 13 Mar 2016 16:35:12 -0700

GitHub user mccheah opened a pull request:

    https://github.com/apache/spark/pull/11688


    SPARK-13335 Use declarative aggregate for collect_list

    The current implementation of collect_list uses the Hive UDAF, which is 
ill-performant especially considering the need for ImperativeAggregate to 
convert between Catalyst types and Scala types.
    
    In the case of collect_list, we can bypass converting the elements of the 
groups and just combine the encompassing arrays.
    
    I added a new unit test in DataFrameAggregateSuite to exercise this code 
path. I don't have a formal comparison between the Hive UDAF and this, 
unfortunately - I did write an imperative aggregate implementation of 
collect_list and found the declarative variant to be significantly faster. More 
specific numbers should be recorded here when we get the chance.
    
    Some things to think about:
    - Code generation for the expressions - I'm not too familiar with writing 
code generation pieces, so it would be good to fill this in.
    - I wonder if we can do better for the memory allocation in the 
expressions. Right now every updateList() call calls Array.:+ which isn't the 
most efficient.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/palantir/spark 
feature/spark-13335-collect-list

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11688.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11688
    
----
commit cb95887e38be02d8534614a07c300ed2bad475f2
Author: mcheah <[email protected]>
Date:   2016-03-13T23:10:22Z

    SPARK-13335 Use declarative aggregate for collect_list
    
    The current implementation of collect_list uses the Hive UDAF, which is
    ill-performant especially considering the need for ImperativeAggregate
    to convert between Catalyst types and Scala types.
    
    In the case of collect_list, we can bypass converting the elements of
    the groups and just combine the encompassing arrays.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: SPARK-13335 Use declarative aggregate for coll...

Reply via email to