Re: [I] Hash aggregation produces batches reporting huge memory size [datafusion]

via GitHub Wed, 10 Jun 2026 02:42:05 -0700


Samyak2 commented on issue #22526:
URL: https://github.com/apache/datafusion/issues/22526#issuecomment-4668727017


   >   1. the memory is reserved before the large record batch is created, so 
there's no guarantee that the released memory is sufficient for reserving the 
large record batch (well, a slice of this large record batch, but for memory 
accounting purposes it doesn't matter)
   > 
   > 
   >   3. since there's no way to transfer a memory reservation from one 
operator to another, other memory pool operations could happen in-between, so 
there's no guarantee that if you free N bytes from a reservation in an operator 
you could reserve the same N bytes in another operator
   
   Very valid points. But all of these also apply to the current memory 
tracking, which is `get_record_batch_memory_size` (used in HashJoin, 
Repartition, etc.). Currently, the downstream operator will try to reserve a 
lot more memory than what agg released. What I'm suggesting is strictly an 
improvement over the current behavior. Do you see any of these problems being 
made worse by a solution like https://github.com/apache/datafusion/pull/22862?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Hash aggregation produces batches reporting huge memory size [datafusion]

Reply via email to