GitHub user chenghao-intel opened a pull request:
https://github.com/apache/spark/pull/3247
[SPARK-4233] [SQL] WIP:Simplify the UDAF API (Interface)
Simplify the UDAF API is the first step of optimization for Aggregation
(see https://issues.apache.org/jira/browse/SPARK-4366).
Currently UDAF cannot scale up when data volume grows, particularly for the
`distinct` aggregation expressions. This PR doesn't aim for fixing the
`distinct` performance, but facilitate
* More straightforward API for UDAF implementation
Developers will not write the distinct expression any more, like
`DistinctAverage` is not necessary, since `Average` is provided, the framework
will handle the `distinct` internally
* Schema-ed Aggregation Buffer
Aggregation Buffer is stored as `MutableRow` with schema, which means the
UDAF developers will benefit from the Catalyst Expression framework in UDAF
development. And shuffling the aggregation buffers cross the machine boundary
is transparently from the UDAF developers, too, that provide us a chance to
switch the aggregation algorithm (sort-based / hash-based)
It's a WIP PR, hopefully we could discuss the UDAF interface design in
details, and a sample implementation for `max`, and `avg` is also provided.
Once we got generally agreement for the new UDAF design, I will continue to
finish this PR:
* Integrated with Hive UDAF (Generic UDAF)
* Re-implemented the existed UDAF (e.g. `max`, `first`, `last` etc.)
* Code style issues.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/chenghao-intel/spark aggr
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/3247.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #3247
----
commit bb79ea8086566cf0a64a47f9db6642ff68a1064e
Author: Cheng Hao <[email protected]>
Date: 2014-11-13T14:02:27Z
WIP:simplify the UDAF interface
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]