LiaCastaneda commented on issue #16841: URL: https://github.com/apache/datafusion/issues/16841#issuecomment-3563643947
👋 I was looking into this issue and thinking around ways to integrate the new arrow memory tracking API to the operators, instead of doing `compact()` or `get_slice_memory_size()` to avoid overaccounting. Scoping it down to `GroupedHashAggregateStream` for simplicity (and since it is where people have seen this issue the most), I imagine we could include the Arrow MemoryPool `arrow_pool: Arc<TrackingMemoryPool>,` in the struct in order to track the actual `Buffer` size in this operator overall, then we wouldn't need to call `size()` of each accumulator in [update_memory_reservation](https://github.com/apache/datafusion/blob/78a4bbab6f329187cdc6e417be1e48bb22cf29fe/datafusion/physical-plan/src/aggregates/row_hash.rs#L972): `let acc = self.accumulators.iter().map(|x| x.size()).sum::<usize>(); `<- where some accumulators might actually share the same underlying `Buffer`. I would even argue that accumulators used by `GroupedHashAggregateStream` don't need a `size()` function, since at the end what holds the accumulators is `GroupedHashAggregateStream` and acumulators within that operator won't handle more memory than batches processed while this stream polls. So I imagine a flow like: 1. a RecordBatch arrives in group_aggregate_batch() 2. Before processing: Claim all buffers from the batch into arrow_pool (using bytes.claim(&arrow_pool)) Process the batch normally, creating the ScalarValues and updating accumulators (the ScalarValues reference the already-claimed buffers) 2. Everytime we call update_memory_reservation(): Use `arrow_pool.used()` to get the actual buffer memory, I guess here we should also include anyother overhead memory I think this is more or less what `compact()` does but without copying data and from the operator level, not from the accumulator level. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
