ctsk commented on code in PR #17315:
URL: https://github.com/apache/datafusion/pull/17315#discussion_r2306281608


##########
datafusion/physical-plan/src/spill/spill_manager.rs:
##########
@@ -194,7 +195,84 @@ impl GetSlicedSize for RecordBatch {
         for array in self.columns() {
             let data = array.to_data();
             total += data.get_slice_memory_size()?;
+
+            // While StringViewArray holds large data buffer for non inlined 
string, the Arrow layout (BufferSpec)
+            // does not include any data buffers. Currently, 
ArrayData::get_slice_memory_size()
+            // under-counts memory size by accounting only views buffer 
although data buffer is cloned during slice()
+            //
+            // Therefore, we manually add the sum of the lengths used by all 
non inlined views
+            // on top of the sliced size for views buffer. This matches the 
intended semantics of
+            // "bytes needed if we materialized exactly this slice into fresh 
buffers".
+            // Note: if multiple arrays share the same data buffers, we may 
double count each StringViewArray.

Review Comment:
   It is not guaranteed that the buffers inside a StringViewArray are distinct 
:/
   
   Or to be more precise: It does *definitely* occur that a StringViewArray 
holds multiple references to the same buffer due, because the concat kernel 
does not ensure that buffers are unique. One common case is when datafusion 
processes a join, and then coalesces the batches that are emitted from the join.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to