ctsk opened a new issue, #17897:
URL: https://github.com/apache/datafusion/issues/17897

   ### Describe the bug
   
   MinMaxBytesAccumulator's update_batch function has runtime that quadratic in 
the number of groups accumulated: On each update_batch call, the implementation 
allocates a new vector that is sized in the number of total groups accumulated. 
As more and more groups are added, this allocation grows and eventually 
dominates runtime.
   
   
https://github.com/apache/datafusion/blob/10a437b826568c27b81d7d16a02b938a13d1a4ad/datafusion/functions-aggregate/src/min_max/min_max_bytes.rs#L438-L456
   
   This completely kills performance for high-cardinality aggregations that use 
this accumulator. One such query would be:
   
   `select l_orderkey, l_partkey, min(l_comment) from lineitem group by 
l_orderkey, l_partkey`
   
   ... ran on a TPC-H dataset.
   
   ### To Reproduce
   
   _No response_
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   Possible fixes:
   
   - Use a hash table (sized for the number of groups in the batch) instead of 
a vector
   
   Possible follow-ups:
   
   - Reuse the hash table between update_batch calls


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to