[ 
https://issues.apache.org/jira/browse/SPARK-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14557060#comment-14557060
 ] 

Cheng Hao commented on SPARK-4233:
----------------------------------

The interface changes is for scalability and performance concerns.
For scalability:
1) Can we just write a single class for supporting both `distinct` and 
`non-distinct` aggregation from user point of view?
2) Is there anyway that we can simplify the partial aggregation, to make EVERY 
aggregate function support the partial aggregation?

For performance:
1) In a partial aggregation, particularly for the case of `distinct aggregate`, 
how can we makes the `seen` faster in ser/deser/shuffling?
2) How can we optimize the performance for the data skew cases?

Solution:
Essentially, we refactor the UDAF interface to take out the `AggregateBuffer` 
from the `AggregateFunction`, and `AggregateFunction` only need to address the 
following issues:
1) How to apply a row with a given `AggregateBuffer` and `Seens`
2) How to merge 2 `AggregateBuffer`s
3) How to export the final result by a given `AggregateBuffer`

And after that, the execution operator can handle the reset of 
things/optimization like (probably in couples of PRs later on):
1) Better handle the `Seen` with faster algorithm / implementation
2) Partial Aggregate support is the default behavior.
3) Handle data skew by involved more shuffle stages
4) Better implementation shuffling / SerDe for `AggregateBuffer` (codegen? as 
we have the schema for the `AggregateBuffer`)
...

> Simplify the Aggregation Function implementation
> ------------------------------------------------
>
>                 Key: SPARK-4233
>                 URL: https://issues.apache.org/jira/browse/SPARK-4233
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Cheng Hao
>
> Currently, the UDAF implementation is quite complicated, and we have to 
> provide distinct & non-distinct version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to