LiaCastaneda commented on issue #19386:
URL: https://github.com/apache/datafusion/issues/19386#issuecomment-3696025104

   > I don't think its memory leak. The record batch size in 
GroupHashExecStream is 4196909056 and since we do slice on the record batch and 
send it to consumers, each sliced batch's size is still calculated as 
4196909056 bytes.
   
   I think compacting would be a straightforward solution, however in some 
situations we’ve observed that compacting has a performance impact (e.g., 
https://github.com/apache/datafusion/pull/16519). We’ve seen a similar issue 
when DataFusion calculates the size of accumulators, and I found that using the 
Arrow memory pool API solves this (It correctly counts unique Arrow Buffers). 
However, I’m not yet sure how to integrate it properly with each operator. I 
have an experimental draft that addresses the over accounting issue for 
accumulators using the memory pool 
(https://github.com/apache/datafusion/pull/19501), but it’s probably not the 
cleanest solution, and its isolated to Aggregations only, so TopK would need 
another Arrow memory pool.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to