2010YOUY01 commented on issue #16244: URL: https://github.com/apache/datafusion/issues/16244#issuecomment-2990112035
> > A quick reminder for someone who is willing to implement it: It's possible that multiple `Array`s share the same underlying buffer -- those `Array`s can be within the same batch or across batches. The current memory accounting utilities only support detect buffer reuse within the batch (not across batches), so a extension to this might be needed. > > [datafusion/datafusion/physical-plan/src/spill/mod.rs](https://github.com/apache/datafusion/blob/d68111df51980a691dfa8746e34566126858558f/datafusion/physical-plan/src/spill/mod.rs#L239) > > Line 239 in [d68111d](/apache/datafusion/commit/d68111df51980a691dfa8746e34566126858558f) > > pub fn get_record_batch_memory_size(batch: &RecordBatch) -> usize { > > I'm wondering about a good approach here. `get_record_batch_memory_size` uses the data pointers of the buffers to deduplicate. Within a single record batch, that doesn't seem problematic. I'm not sure if we can use the same approach across multiple record batches. If I understand correctly, those record batches do not have the same lifetime, so the same pointer could be allocated twice and filled with different data. (Can Buffer objects themselves be reused with different data?) In this case, we want to count the pointer/buffer twice. That being said, this scenario might happen infrequently enough for us to not care. > > There's also some discussion around this going on over at [apache/arrow-rs#6439](https://github.com/apache/arrow-rs/issues/6439) which makes it sound like _accurately measuring bytes_ in this scenario is a bigger issue. [@2010YOUY01](https://github.com/2010YOUY01), do you have thoughts on this? > > An alternative approach could be to calculate `output_bytes` based on the actual number of bytes occupied on the buffer by the array. This should be accurate but might be expensive. I agree with the concerns about the buffer lifetime across batches. Another problem might be: if the input is very large this hashset-based solution might cause noticeable slowdown. I think we need an arrow-side solution that better solve this issue. This refcount solution seems a good idea to me, but it requires more effort to move it forward https://github.com/apache/arrow-rs/pull/6590 Now for datafusion, maybe the best thing to do is simply `sum(get_record_batch_memory_size())`, this should work for common cases, and open an github issue to explain the limitation and track future TODOs to make it accurate. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org