bharath-techie commented on issue #19386: URL: https://github.com/apache/datafusion/issues/19386#issuecomment-3675113488
I don't think its memory leak. The record batch size in GroupHashExecStream is `4196909056` and since we do slice on the record batch and send it to consumers, each sliced batch's size is still calculated as `4196909056` bytes. ``` Record batch size during insert : 4196909056 self size : 48 , capacity : 3 , heap size : 60, batch size : 4196909056 size during get : 4196909284 Record batch size during insert size during insert : 4196909056 self size : 48 , capacity : 3 , heap size : 60, batch size : 8393818112 size during get : 8393818340 Record batch size during insert size during insert : 4196909056 self size : 48 , capacity : 3 , heap size : 60, batch size : 12590727168 size during get : 12590727396 ``` https://github.com/apache/datafusion/issues/9562 - check this issue for more. ``` As we see in https://github.com/apache/datafusion/issues/9417, if there are upstream operators like TopK that hold references to any of these sliced RecordBatchs, those slices are treated as though they were an additional allocation that needs to be tracked ([source](https://github.com/apache/arrow-datafusion/blob/e642cc2a94f38518d765d25c8113523aedc29198/datafusion/physical-plan/src/topk/mod.rs#L576)) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
