2010YOUY01 commented on code in PR #13377: URL: https://github.com/apache/datafusion/pull/13377#discussion_r1841758359
########## datafusion/physical-plan/src/spill.rs: ########## @@ -109,10 +111,80 @@ pub fn spill_record_batch_by_size( Ok(()) } +/// Calculate total used memory of this batch. +/// +/// This function is used to estimate the physical memory usage of the `RecordBatch`. The implementation will add up all unique `Buffer`'s memory +/// size, due to: +/// - The data pointer inside `Buffer` are memory regions returned by global memory +/// allocator, those regions can't have overlap. +/// - The actual used range of `ArrayRef`s inside `RecordBatch` can have overlap +/// or reuse the same `Buffer`. For example: taking a slice from `Array`. +/// +/// Example: +/// For a `RecordBatch` with two columns: `col1` and `col2`, two columns are pointing +/// to a sub-region of the same buffer. +/// +/// {xxxxxxxxxxxxxxxxxxx} <--- buffer +/// ^ ^ ^ ^ +/// | | | | +/// col1->{ } | | +/// col2--------->{ } +/// +/// In the above case, `get_record_batch_memory_size` will return the size of +/// the buffer, instead of the sum of `col1` and `col2`'s actual memory size. +/// +/// Note: Current `RecordBatch`.get_array_memory_size()` will double count the +/// buffer memory size if multiple arrays within the batch are sharing the same +/// `Buffer`. This method provides temporary fix until the issue is resolved: +/// <https://github.com/apache/arrow-rs/issues/6439> +pub fn get_record_batch_memory_size(batch: &RecordBatch) -> usize { Review Comment: I think they should all be changed, however after changing them in `TopK`, some existing test cases might be tricky to fix, and more end-to-end tests should be added. So I plan to do it incrementally. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org