GitHub user chenghao-intel opened a pull request:

    https://github.com/apache/spark/pull/3247

    [SPARK-4233] [SQL] WIP:Simplify the UDAF API (Interface)

    Simplify the UDAF API is the first step of optimization for Aggregation 
(see https://issues.apache.org/jira/browse/SPARK-4366).
    
    Currently UDAF cannot scale up when data volume grows, particularly for the 
`distinct` aggregation expressions.  This PR doesn't aim for fixing the 
`distinct` performance, but facilitate 
    
    * More straightforward API for UDAF implementation
    Developers will not write the distinct expression any more, like 
`DistinctAverage` is not necessary, since `Average` is provided, the framework 
will handle the `distinct` internally
    * Schema-ed Aggregation Buffer
    Aggregation Buffer is stored as `MutableRow` with schema, which means the 
UDAF developers will benefit from the Catalyst Expression framework in UDAF 
development. And shuffling the aggregation buffers cross the machine boundary 
is transparently from the UDAF developers, too, that provide us a chance to 
switch the aggregation algorithm (sort-based / hash-based)
    
    It's a WIP PR, hopefully we could discuss the UDAF interface design in 
details, and a sample implementation for `max`, and `avg` is also provided.
    
    Once we got generally agreement for the new UDAF design, I will continue to 
finish this PR:
    * Integrated with Hive UDAF (Generic UDAF)
    * Re-implemented the existed UDAF (e.g. `max`, `first`, `last` etc.)
    * Code style issues. 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/chenghao-intel/spark aggr

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3247.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3247
    
----
commit bb79ea8086566cf0a64a47f9db6642ff68a1064e
Author: Cheng Hao <[email protected]>
Date:   2014-11-13T14:02:27Z

    WIP:simplify the UDAF interface

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to