2010YOUY01 commented on issue #16244:
URL: https://github.com/apache/datafusion/issues/16244#issuecomment-2990112035

   > > A quick reminder for someone who is willing to implement it: It's 
possible that multiple `Array`s share the same underlying buffer -- those 
`Array`s can be within the same batch or across batches. The current memory 
accounting utilities only support detect buffer reuse within the batch (not 
across batches), so a extension to this might be needed.
   > > 
[datafusion/datafusion/physical-plan/src/spill/mod.rs](https://github.com/apache/datafusion/blob/d68111df51980a691dfa8746e34566126858558f/datafusion/physical-plan/src/spill/mod.rs#L239)
   > > Line 239 in 
[d68111d](/apache/datafusion/commit/d68111df51980a691dfa8746e34566126858558f)
   > > pub fn get_record_batch_memory_size(batch: &RecordBatch) -> usize {
   > 
   > I'm wondering about a good approach here. `get_record_batch_memory_size` 
uses the data pointers of the buffers to deduplicate. Within a single record 
batch, that doesn't seem problematic. I'm not sure if we can use the same 
approach across multiple record batches. If I understand correctly, those 
record batches do not have the same lifetime, so the same pointer could be 
allocated twice and filled with different data. (Can Buffer objects themselves 
be reused with different data?) In this case, we want to count the 
pointer/buffer twice. That being said, this scenario might happen infrequently 
enough for us to not care.
   > 
   > There's also some discussion around this going on over at 
[apache/arrow-rs#6439](https://github.com/apache/arrow-rs/issues/6439) which 
makes it sound like _accurately measuring bytes_ in this scenario is a bigger 
issue. [@2010YOUY01](https://github.com/2010YOUY01), do you have thoughts on 
this?
   > 
   > An alternative approach could be to calculate `output_bytes` based on the 
actual number of bytes occupied on the buffer by the array. This should be 
accurate but might be expensive.
   
   I agree with the concerns about the buffer lifetime across batches. Another 
problem might be: if the input is very large this hashset-based solution might 
cause noticeable slowdown.
   
   I think we need an arrow-side solution that better solve this issue. This 
refcount solution seems a good idea to me, but it requires more effort to move 
it forward https://github.com/apache/arrow-rs/pull/6590
   
   Now for datafusion, maybe the best thing to do is simply 
`sum(get_record_batch_memory_size())`, this should work for common cases, and 
open an github issue to explain the limitation and track future TODOs to make 
it accurate.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to