[
https://issues.apache.org/jira/browse/SPARK-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14557060#comment-14557060
]
Cheng Hao commented on SPARK-4233:
----------------------------------
The interface changes is for scalability and performance concerns.
For scalability:
1) Can we just write a single class for supporting both `distinct` and
`non-distinct` aggregation from user point of view?
2) Is there anyway that we can simplify the partial aggregation, to make EVERY
aggregate function support the partial aggregation?
For performance:
1) In a partial aggregation, particularly for the case of `distinct aggregate`,
how can we makes the `seen` faster in ser/deser/shuffling?
2) How can we optimize the performance for the data skew cases?
Solution:
Essentially, we refactor the UDAF interface to take out the `AggregateBuffer`
from the `AggregateFunction`, and `AggregateFunction` only need to address the
following issues:
1) How to apply a row with a given `AggregateBuffer` and `Seens`
2) How to merge 2 `AggregateBuffer`s
3) How to export the final result by a given `AggregateBuffer`
And after that, the execution operator can handle the reset of
things/optimization like (probably in couples of PRs later on):
1) Better handle the `Seen` with faster algorithm / implementation
2) Partial Aggregate support is the default behavior.
3) Handle data skew by involved more shuffle stages
4) Better implementation shuffling / SerDe for `AggregateBuffer` (codegen? as
we have the schema for the `AggregateBuffer`)
...
> Simplify the Aggregation Function implementation
> ------------------------------------------------
>
> Key: SPARK-4233
> URL: https://issues.apache.org/jira/browse/SPARK-4233
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Reporter: Cheng Hao
>
> Currently, the UDAF implementation is quite complicated, and we have to
> provide distinct & non-distinct version.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]