ctsk opened a new issue, #17897: URL: https://github.com/apache/datafusion/issues/17897
### Describe the bug MinMaxBytesAccumulator's update_batch function has runtime that quadratic in the number of groups accumulated: On each update_batch call, the implementation allocates a new vector that is sized in the number of total groups accumulated. As more and more groups are added, this allocation grows and eventually dominates runtime. https://github.com/apache/datafusion/blob/10a437b826568c27b81d7d16a02b938a13d1a4ad/datafusion/functions-aggregate/src/min_max/min_max_bytes.rs#L438-L456 This completely kills performance for high-cardinality aggregations that use this accumulator. One such query would be: `select l_orderkey, l_partkey, min(l_comment) from lineitem group by l_orderkey, l_partkey` ... ran on a TPC-H dataset. ### To Reproduce _No response_ ### Expected behavior _No response_ ### Additional context Possible fixes: - Use a hash table (sized for the number of groups in the batch) instead of a vector Possible follow-ups: - Reuse the hash table between update_batch calls -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
