Re: [PR] Fix massive spill files for StringView/BinaryView columns II [datafusion]

via GitHub Thu, 16 Apr 2026 11:53:15 -0700


alamb commented on code in PR #21633:
URL: https://github.com/apache/datafusion/pull/21633#discussion_r3095575364



##########
datafusion/physical-plan/src/spill/mod.rs:
##########
@@ -344,6 +349,174 @@ fn get_max_alignment_for_schema(schema: &Schema) -> usize 
{
     max_alignment
 }
 
+/// Size of a single view structure in StringView/BinaryView arrays (in bytes).
+/// Each view is 16 bytes: 4 bytes length + 4 bytes prefix + 8 bytes buffer 
ID/offset.
+const VIEW_SIZE_BYTES: usize = 16;

Review Comment:
   Related constant here: 
https://docs.rs/arrow-data/58.1.0/arrow_data/constant.MAX_INLINE_VIEW_LEN.html
   
   
   
   I think arrow uses  `std::mem::size_of<u128>` for this value as each view is 
a u128
   



##########
datafusion/physical-plan/src/spill/in_progress_spill_file.rs:
##########
@@ -51,16 +54,25 @@ impl InProgressSpillFile {
 
     /// Appends a `RecordBatch` to the spill file, initializing the writer if 
necessary.
     ///
+    /// Before writing, performs GC on StringView/BinaryView arrays to compact 
backing

Review Comment:
   FWIW I think the same general approach might be useful / needed for other 
"view" type arrays -- I am thinking sliced LIstView for example as well as 
sliced ListArray and sliced Utf8 🤔 
   
   Maybe as a follow on PR



##########
datafusion/physical-plan/src/spill/mod.rs:
##########
@@ -344,6 +349,174 @@ fn get_max_alignment_for_schema(schema: &Schema) -> usize 
{
     max_alignment
 }
 
+/// Size of a single view structure in StringView/BinaryView arrays (in bytes).
+/// Each view is 16 bytes: 4 bytes length + 4 bytes prefix + 8 bytes buffer 
ID/offset.
+const VIEW_SIZE_BYTES: usize = 16;
+
+/// Performs garbage collection on StringView and BinaryView arrays before 
spilling to reduce memory usage.
+///
+/// # Why GC is needed
+///
+/// StringView and BinaryView arrays can accumulate significant memory waste 
when sliced.
+/// When a large array is sliced (e.g., taking first 100 rows of 1000), the 
view array
+/// still references the original data buffers containing all 1000 rows of 
data.
+///
+/// For example, in the ClickBench benchmark (issue #19414), repeated slicing 
of StringView
+/// arrays resulted in 820MB of spill files that could be reduced to just 33MB 
after GC -
+/// a 96% reduction in size.
+///
+/// # How it works
+///
+/// The GC process:
+/// 1. Identifies view arrays (StringView/BinaryView) in the batch
+/// 2. Checks if their data buffers exceed a memory threshold
+/// 3. If exceeded, calls the Arrow `gc()` method which creates new compact 
buffers
+///    containing only the data referenced by the current views
+/// 4. Returns a new batch with GC'd arrays (or original arrays if GC not 
needed)
+///
+/// # When GC is triggered
+///
+/// GC is only performed when data buffers exceed a threshold (currently 10KB).
+/// This balances memory savings against the CPU overhead of garbage 
collection.
+/// Small arrays are passed through unchanged since the GC overhead would 
exceed
+/// any memory savings.
+///
+/// # Performance considerations
+///
+/// - If no view arrays need compaction, the original batch is cloned cheaply
+/// - GC is skipped for small buffers to avoid unnecessary CPU overhead
+/// - Nested container types are traversed recursively so view arrays inside
+///   `List`, `Map`, `Union`, `Dictionary`, and other child-bearing arrays are 
compacted too
+/// - The Arrow `gc()` method itself is optimized and only copies referenced 
data
+pub(crate) fn gc_view_arrays(batch: &RecordBatch) -> Result<RecordBatch> {
+    let mut mutated = false;
+    let mut new_columns: Vec<Arc<dyn Array>> = 
Vec::with_capacity(batch.num_columns());
+
+    for array in batch.columns() {
+        let (gc_array, array_mutated) = gc_array(array)?;
+        mutated |= array_mutated;
+        new_columns.push(gc_array);
+    }
+
+    if mutated {
+        Ok(RecordBatch::try_new(batch.schema(), new_columns)?)
+    } else {
+        Ok(batch.clone())
+    }
+}
+
+fn gc_array(array: &ArrayRef) -> Result<(ArrayRef, bool)> {
+    match array.data_type() {
+        DataType::Utf8View => {
+            let string_view = array
+                .as_any()
+                .downcast_ref::<StringViewArray>()
+                .expect("Utf8View array should downcast to StringViewArray");
+            if should_gc_view_array(string_view) {
+                Ok((Arc::new(string_view.gc()) as ArrayRef, true))
+            } else {
+                Ok((Arc::clone(array), false))
+            }
+        }
+        DataType::BinaryView => {
+            let binary_view = array
+                .as_any()
+                .downcast_ref::<BinaryViewArray>()
+                .expect("BinaryView array should downcast to BinaryViewArray");
+            if should_gc_view_array(binary_view) {
+                Ok((Arc::new(binary_view.gc()) as ArrayRef, true))
+            } else {
+                Ok((Arc::clone(array), false))
+            }
+        }
+        _ => gc_array_children(array),
+    }
+}
+
+fn gc_array_children(array: &ArrayRef) -> Result<(ArrayRef, bool)> {
+    let data = array.to_data();

Review Comment:
   FWIW to_data is not free (it allocates a vec, etc)
   
   But I see the need to traverse a nested array and gc the whole thing. Short 
of adding specific code for each array type I think using ArrayData is the best 
we can do



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Fix massive spill files for StringView/BinaryView columns II [datafusion]

Reply via email to