Re: [PR] refactor: Split hash aggregation logic into separated streams [datafusion]

via GitHub Wed, 03 Jun 2026 04:04:08 -0700


2010YOUY01 commented on PR #22729:
URL: https://github.com/apache/datafusion/pull/22729#issuecomment-4611653471


   > I'm curious about the high-level vision: is the plan to close #15591 in 
favor of this new approach?
   
   Yes, the goal is to support blocked state management.
   
   The existing challenge is that the current implementation is hard to extend 
and review. I want to clean things up through this refactor first, and then 
apply the actual change.
   
   > I would like the redesign of hash aggregation to take into account the 
memory constraints imposed by the finite memory pool, i.e. how does the 
implementation perform under OOM conditions.
   > 
   > * how do we improve memory accounting (see [Hash aggregation produces 
batches reporting huge memory size 
#22526](https://github.com/apache/datafusion/issues/22526)).
   > * how do we avoid excessive memory allocations during OOM condition (see 
[fix: reduce memory allocation overhead during partial aggregation ear… 
#22165](https://github.com/apache/datafusion/pull/22165))
   > * other issues such as [[EPIC] Eliminate Long Polls in HashAggregate via 
Chunked Storage and Incremental Emission 
#19906](https://github.com/apache/datafusion/issues/19906)
   > 
   > Otherwise we'll end up with the same issues that exist now. E.g. 
EmitTo::First(n) wasn't designed for emitting a large portion of the existing 
groups, so it over-allocated when used for emitting early in partial 
aggregation OOM case.
   
   All of these issues are symptoms of managing state in a large contiguous 
`Vec`. Blocked memory allocation should address them naturally.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] refactor: Split hash aggregation logic into separated streams [datafusion]

Reply via email to