GitHub user mccheah opened a pull request:
https://github.com/apache/spark/pull/11688
SPARK-13335 Use declarative aggregate for collect_list
The current implementation of collect_list uses the Hive UDAF, which is
ill-performant especially considering the need for ImperativeAggregate to
convert between Catalyst types and Scala types.
In the case of collect_list, we can bypass converting the elements of the
groups and just combine the encompassing arrays.
I added a new unit test in DataFrameAggregateSuite to exercise this code
path. I don't have a formal comparison between the Hive UDAF and this,
unfortunately - I did write an imperative aggregate implementation of
collect_list and found the declarative variant to be significantly faster. More
specific numbers should be recorded here when we get the chance.
Some things to think about:
- Code generation for the expressions - I'm not too familiar with writing
code generation pieces, so it would be good to fill this in.
- I wonder if we can do better for the memory allocation in the
expressions. Right now every updateList() call calls Array.:+ which isn't the
most efficient.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/palantir/spark
feature/spark-13335-collect-list
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/11688.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #11688
----
commit cb95887e38be02d8534614a07c300ed2bad475f2
Author: mcheah <[email protected]>
Date: 2016-03-13T23:10:22Z
SPARK-13335 Use declarative aggregate for collect_list
The current implementation of collect_list uses the Hive UDAF, which is
ill-performant especially considering the need for ImperativeAggregate
to convert between Catalyst types and Scala types.
In the case of collect_list, we can bypass converting the elements of
the groups and just combine the encompassing arrays.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]