alamb opened a new issue, #13548: URL: https://github.com/apache/datafusion/issues/13548
### Is your feature request related to a problem or challenge? The basic aggregate functions like `COUNT` and `SUM` in DataFusion are *very* fast (see [Apache DataFusion is now the fastest single node engine for querying Apache Parquet files](https://datafusion.apache.org/blog/2024/11/18/datafusion-fastest-single-node-parquet-clickbench/)) However, many of the other aggregate functions are not particularly fast, and this shows up specifically on some of the H20 benchmarks We saw this in the results in the [2024 DataFusion SIGMOD paper](https://dl.acm.org/doi/10.1145/3626246.3653368)  (BTW we have made median faster) @MrPowers has also observed similar results: > DataFusion was added to the h2o benchmarks (which are now maintained by duckdb) and DataFusion performs quite well for most of the "basic" groupby queries. It performs poorly for some of the advanced questions on the 50GB dataset. Here are the results: > https://duckdblabs.github.io/db-benchmark/ See his version of the benchmarks here https://github.com/MrPowers/mrpowers-benchmarks ### Describe the solution you'd like DataFusion has two APIs ways to implement Aggregate functions like `SUM` and `COUNT` - Easy (but slow) way: `Accumulator` ([api docs](https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.Accumulator.html)) - Fast (but complicated way): `GroupsAccumulator` ([api docs](https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.GroupsAccumulator.html)) The basic aggregates are implemented using `GroupsAccumulator` and are part of DataFusions performance This ticket tracks the effort to improve the performance of these for these "more advanced" aggregate functions, likely by implementing `GroupsAccumulator` ### Describe alternatives you've considered For each function listed above, ideally we would: 1. Add a new benchmark to the ClickBench extended benchmark [Documentation Here](https://github.com/apache/datafusion/tree/main/benchmarks/queries/clickbench#extended-queries) in one PR 2. Implement `GroupsAccumulator` for the relevant aggregate function in a second PR (along with tests for correctness). We would use the benchmark to verify the performance Here is a pretty good example of how @eejbyfeldt did this for `STDDEV`: - https://github.com/apache/datafusion/pull/12095 ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
