hendrikmakait commented on issue #16244: URL: https://github.com/apache/datafusion/issues/16244#issuecomment-2988800750
> A quick reminder for someone who is willing to implement it: It's possible that multiple `Array`s share the same underlying buffer -- those `Array`s can be within the same batch or across batches. The current memory accounting utilities only support detect buffer reuse within the batch (not across batches), so a extension to this might be needed. > > [datafusion/datafusion/physical-plan/src/spill/mod.rs](https://github.com/apache/datafusion/blob/d68111df51980a691dfa8746e34566126858558f/datafusion/physical-plan/src/spill/mod.rs#L239) > > Line 239 in [d68111d](/apache/datafusion/commit/d68111df51980a691dfa8746e34566126858558f) > > pub fn get_record_batch_memory_size(batch: &RecordBatch) -> usize { I'm wondering about a good approach here. `get_record_batch_memory_size` uses the data pointers of the buffers to deduplicate. Within a single record batch, that doesn't seem problematic. I'm not sure if we can use the same approach across multiple record batches. If I understand correctly, those record batches do not have the same lifetime, so the same pointer could be allocated twice and filled with different data. (Can Buffer objects themselves be reused with different data?) In this case, we want to count the pointer twice. That being said, this scenario might happen infrequently enough for us to not care. There's also some discussion around this going on over at https://github.com/apache/arrow-rs/issues/6439 which makes it sound like _accurately measuring bytes_ in this scenario is a bigger issue. @2010YOUY01, do you have thoughts on this? An alternative approach could be to calculate `output_bytes` based on the actual number of bytes occupied on the buffer by the array. This should be accurate but might be expensive. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org