[I] [EPIC] Improved aggregate function performance [datafusion]

via GitHub Sun, 24 Nov 2024 05:51:46 -0800


alamb opened a new issue, #13548:
URL: https://github.com/apache/datafusion/issues/13548


   ### Is your feature request related to a problem or challenge?
   
   The basic aggregate functions like `COUNT` and `SUM` in DataFusion are 
*very* fast (see [Apache DataFusion is now the fastest single node engine for 
querying Apache Parquet 
files](https://datafusion.apache.org/blog/2024/11/18/datafusion-fastest-single-node-parquet-clickbench/))
   
   However, many of the other aggregate functions are not particularly fast, 
and this shows up specifically on some of the H20 benchmarks
   
   We saw this in the results in the [2024 DataFusion SIGMOD 
paper](https://dl.acm.org/doi/10.1145/3626246.3653368)
   ![Screenshot 2024-11-24 at 8 34 35 
AM](https://github.com/user-attachments/assets/72338cd9-3b1d-4feb-ae65-29b9c53ac3da)
   
   (BTW we have made median faster)
   
   @MrPowers has also observed similar results:
   > DataFusion was added to the h2o benchmarks (which are now maintained by 
duckdb) and DataFusion performs quite well for most of the "basic" groupby 
queries.  It performs poorly for some of the advanced questions on the 50GB 
dataset.  Here are the results: 
   > https://duckdblabs.github.io/db-benchmark/
   
   See his version of the benchmarks here
   https://github.com/MrPowers/mrpowers-benchmarks
   
   
   
   ### Describe the solution you'd like
   
   DataFusion has two APIs ways to implement Aggregate functions like `SUM` and 
`COUNT`
   - Easy (but slow) way: `Accumulator` ([api 
docs](https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.Accumulator.html))
   - Fast (but complicated way): `GroupsAccumulator` ([api 
docs](https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.GroupsAccumulator.html))
   
   The basic aggregates are implemented using `GroupsAccumulator` and are part 
of DataFusions  performance 
   
   This ticket tracks the effort to improve the performance of these  for these 
"more advanced" aggregate functions, likely by implementing `GroupsAccumulator`
   
   
   
   ### Describe alternatives you've considered
   
   For each function listed above, ideally we would:
   1. Add a new benchmark to the ClickBench extended benchmark [Documentation 
Here](https://github.com/apache/datafusion/tree/main/benchmarks/queries/clickbench#extended-queries)
 in one PR
   2. Implement `GroupsAccumulator` for the relevant aggregate function in a 
second PR (along with tests for correctness). We would use the benchmark to 
verify the performance
   
   
   Here is a pretty good example of how @eejbyfeldt  did this for `STDDEV`:
   - https://github.com/apache/datafusion/pull/12095
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] [EPIC] Improved aggregate function performance [datafusion]

Reply via email to