Re: [PR] Fix massive spill files for StringView/BinaryView columns II [datafusion]

via GitHub Wed, 15 Apr 2026 01:54:10 -0700


martin-g commented on code in PR #21633:
URL: https://github.com/apache/datafusion/pull/21633#discussion_r3085218785



##########
datafusion/physical-plan/src/spill/mod.rs:
##########
@@ -344,6 +346,153 @@ fn get_max_alignment_for_schema(schema: &Schema) -> usize 
{
     max_alignment
 }
 
+/// Size of a single view structure in StringView/BinaryView arrays (in bytes).
+/// Each view is 16 bytes: 4 bytes length + 4 bytes prefix + 8 bytes buffer 
ID/offset.
+const VIEW_SIZE_BYTES: usize = 16;
+
+/// Performs garbage collection on StringView and BinaryView arrays before 
spilling to reduce memory usage.
+///
+/// # Why GC is needed
+///
+/// StringView and BinaryView arrays can accumulate significant memory waste 
when sliced.
+/// When a large array is sliced (e.g., taking first 100 rows of 1000), the 
view array
+/// still references the original data buffers containing all 1000 rows of 
data.
+///
+/// For example, in the ClickBench benchmark (issue #19414), repeated slicing 
of StringView
+/// arrays resulted in 820MB of spill files that could be reduced to just 33MB 
after GC -
+/// a 96% reduction in size.
+///
+/// # How it works
+///
+/// The GC process:
+/// 1. Identifies view arrays (StringView/BinaryView) in the batch
+/// 2. Checks if their data buffers exceed a memory threshold
+/// 3. If exceeded, calls the Arrow `gc()` method which creates new compact 
buffers
+///    containing only the data referenced by the current views
+/// 4. Returns a new batch with GC'd arrays (or original arrays if GC not 
needed)
+///
+/// # When GC is triggered
+///
+/// GC is only performed when data buffers exceed a threshold (currently 10KB).
+/// This balances memory savings against the CPU overhead of garbage 
collection.
+/// Small arrays are passed through unchanged since the GC overhead would 
exceed
+/// any memory savings.
+///
+/// # Performance considerations
+///
+/// The function always returns a new RecordBatch for API consistency, but:
+/// - If no view arrays are present, it's a cheap clone (just Arc increments)
+/// - GC is skipped for small buffers to avoid unnecessary CPU overhead
+/// - The Arrow `gc()` method itself is optimized and only copies referenced 
data
+pub(crate) fn gc_view_arrays(batch: &RecordBatch) -> Result<RecordBatch> {
+    // Early return optimization: Skip GC entirely if the batch contains no 
view arrays.
+    // This avoids unnecessary processing for batches with only primitive 
types.
+    let has_view_arrays = batch.columns().iter().any(|array| {
+        matches!(
+            array.data_type(),
+            arrow::datatypes::DataType::Utf8View | 
arrow::datatypes::DataType::BinaryView

Review Comment:
   What about nested types (List, Map, Union, Dictionary) which contain these 
views ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Fix massive spill files for StringView/BinaryView columns II [datafusion]

Reply via email to