Re: [PR] fix: count shared buffers once in hash join build-side memory accounting [datafusion]

via GitHub Wed, 10 Jun 2026 02:52:48 -0700


Samyak2 commented on PR #22862:
URL: https://github.com/apache/datafusion/pull/22862#issuecomment-4668820557


   > 1. I don't like this approach because it's a workaround for the hash 
aggregate issue, which is going to be reworked; additionally, it's not accurate 
memory accounting: when the first batch arrives, it reserves the memory for all 
the subsequent batches; this reservation could fail, meaning that if the 
operator supports spilling, it would spill on every batch (because each batch 
carries the entire memory reservation with it)
   
   Continuing from 
https://github.com/apache/datafusion/issues/22526#issuecomment-4668727017, this 
happens with the current memory tracking too. I see this PR as a strict 
improvement over the current `get_record_batch_memory_size` usage.
   
   >  2. Since GroupedHashAggregateStream could have multiple downstream 
operators, is the plan to change the memory accouting for those as well?
   
   I would imagine so. As an example, the same problem currently exists with 
Repartition over Aggregate. The same fix would be needed in any operator that 
stores a sequence of `RecordBatch`es.
   
   For Repartition specifically, we would need to extend this API to be able to 
remove a record batch too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] fix: count shared buffers once in hash join build-side memory accounting [datafusion]

Reply via email to