LiaCastaneda commented on issue #19386: URL: https://github.com/apache/datafusion/issues/19386#issuecomment-3696025104
> I don't think its memory leak. The record batch size in GroupHashExecStream is 4196909056 and since we do slice on the record batch and send it to consumers, each sliced batch's size is still calculated as 4196909056 bytes. I think compacting would be a straightforward solution, however in some situations we’ve observed that compacting has a performance impact (e.g., https://github.com/apache/datafusion/pull/16519). We’ve seen a similar issue when DataFusion calculates the size of accumulators, and I found that using the Arrow memory pool API solves this (It correctly counts unique Arrow Buffers). However, I’m not yet sure how to integrate it properly with each operator. I have an experimental draft that addresses the over accounting issue for accumulators using the memory pool (https://github.com/apache/datafusion/pull/19501), but it’s probably not the cleanest solution, and its isolated to Aggregations only, so TopK would need another Arrow memory pool. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
