hendrikmakait commented on issue #16244:
URL: https://github.com/apache/datafusion/issues/16244#issuecomment-2988800750

   > A quick reminder for someone who is willing to implement it: It's possible 
that multiple `Array`s share the same underlying buffer -- those `Array`s can 
be within the same batch or across batches. The current memory accounting 
utilities only support detect buffer reuse within the batch (not across 
batches), so a extension to this might be needed.
   > 
   > 
[datafusion/datafusion/physical-plan/src/spill/mod.rs](https://github.com/apache/datafusion/blob/d68111df51980a691dfa8746e34566126858558f/datafusion/physical-plan/src/spill/mod.rs#L239)
   > 
   > Line 239 in 
[d68111d](/apache/datafusion/commit/d68111df51980a691dfa8746e34566126858558f)
   > 
   >  pub fn get_record_batch_memory_size(batch: &RecordBatch) -> usize {
   
   I'm wondering about a good approach here. `get_record_batch_memory_size` 
uses the data pointers of the buffers to deduplicate. Within a single record 
batch, that doesn't seem problematic. I'm not sure if we can use the same 
approach across multiple record batches. If I understand correctly, those 
record batches do not have the same lifetime, so the same pointer could be 
allocated twice and filled with different data. (Can Buffer objects themselves 
be reused with different data?) In this case, we want to count the pointer 
twice. That being said, this scenario might happen infrequently enough for us 
to not care.
   
   There's also some discussion around this going on over at 
https://github.com/apache/arrow-rs/issues/6439 which makes it sound like 
_accurately measuring bytes_ in this scenario is a bigger issue. @2010YOUY01, 
do you have thoughts on this?
   
   An alternative approach could be to calculate `output_bytes` based on the 
actual number of bytes occupied on the buffer by the array. This should be 
accurate but might be expensive.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to