martin-g commented on code in PR #21633:
URL: https://github.com/apache/datafusion/pull/21633#discussion_r3085218785
##########
datafusion/physical-plan/src/spill/mod.rs:
##########
@@ -344,6 +346,153 @@ fn get_max_alignment_for_schema(schema: &Schema) -> usize
{
max_alignment
}
+/// Size of a single view structure in StringView/BinaryView arrays (in bytes).
+/// Each view is 16 bytes: 4 bytes length + 4 bytes prefix + 8 bytes buffer
ID/offset.
+const VIEW_SIZE_BYTES: usize = 16;
+
+/// Performs garbage collection on StringView and BinaryView arrays before
spilling to reduce memory usage.
+///
+/// # Why GC is needed
+///
+/// StringView and BinaryView arrays can accumulate significant memory waste
when sliced.
+/// When a large array is sliced (e.g., taking first 100 rows of 1000), the
view array
+/// still references the original data buffers containing all 1000 rows of
data.
+///
+/// For example, in the ClickBench benchmark (issue #19414), repeated slicing
of StringView
+/// arrays resulted in 820MB of spill files that could be reduced to just 33MB
after GC -
+/// a 96% reduction in size.
+///
+/// # How it works
+///
+/// The GC process:
+/// 1. Identifies view arrays (StringView/BinaryView) in the batch
+/// 2. Checks if their data buffers exceed a memory threshold
+/// 3. If exceeded, calls the Arrow `gc()` method which creates new compact
buffers
+/// containing only the data referenced by the current views
+/// 4. Returns a new batch with GC'd arrays (or original arrays if GC not
needed)
+///
+/// # When GC is triggered
+///
+/// GC is only performed when data buffers exceed a threshold (currently 10KB).
+/// This balances memory savings against the CPU overhead of garbage
collection.
+/// Small arrays are passed through unchanged since the GC overhead would
exceed
+/// any memory savings.
+///
+/// # Performance considerations
+///
+/// The function always returns a new RecordBatch for API consistency, but:
+/// - If no view arrays are present, it's a cheap clone (just Arc increments)
+/// - GC is skipped for small buffers to avoid unnecessary CPU overhead
+/// - The Arrow `gc()` method itself is optimized and only copies referenced
data
+pub(crate) fn gc_view_arrays(batch: &RecordBatch) -> Result<RecordBatch> {
+ // Early return optimization: Skip GC entirely if the batch contains no
view arrays.
+ // This avoids unnecessary processing for batches with only primitive
types.
+ let has_view_arrays = batch.columns().iter().any(|array| {
+ matches!(
+ array.data_type(),
+ arrow::datatypes::DataType::Utf8View |
arrow::datatypes::DataType::BinaryView
Review Comment:
What about nested types (List, Map, Union, Dictionary) which contain these
views ?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]