[I] Support grouped aggregates with known min/max statistics [datafusion]

via GitHub Thu, 22 Jan 2026 07:59:07 -0800


Dandandan opened a new issue, #19938:
URL: https://github.com/apache/datafusion/issues/19938


   ### Is your feature request related to a problem or challenge?
   
   Currently, grouped aggregates follow this path (simplified)
   
   * create hashes for columns
   * group by hash using a hash table / check equality
   
   The approach is well optimized, but we can avoid a lot of work if we don't 
have to hash and use a hashtable.
   
   ### Describe the solution you'd like
   
   When the column statistics includinf the range (min/max) s known for a group 
by column, and the range is not too large, we can store the groups in a `Vec` 
where each element at `i` represents the group `min + i`, using direct indexing.
   This could save a lot of overhead.
   This is very similar to whats implemented in 
https://github.com/apache/datafusion/pull/19411 for joins.
   
   
   ### Describe alternatives you've considered
   
   We could also consider computing the statistics on the fly and switch 
dynamically to a hash table vs hash map (i.e. copy all entries to a hash table 
once the range exceeds the maximum).
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Support grouped aggregates with known min/max statistics [datafusion]

Reply via email to