alamb commented on code in PR #21633:
URL: https://github.com/apache/datafusion/pull/21633#discussion_r3095575364
##########
datafusion/physical-plan/src/spill/mod.rs:
##########
@@ -344,6 +349,174 @@ fn get_max_alignment_for_schema(schema: &Schema) -> usize
{
max_alignment
}
+/// Size of a single view structure in StringView/BinaryView arrays (in bytes).
+/// Each view is 16 bytes: 4 bytes length + 4 bytes prefix + 8 bytes buffer
ID/offset.
+const VIEW_SIZE_BYTES: usize = 16;
Review Comment:
Related constant here:
https://docs.rs/arrow-data/58.1.0/arrow_data/constant.MAX_INLINE_VIEW_LEN.html
I think arrow uses `std::mem::size_of<u128>` for this value as each view is
a u128
##########
datafusion/physical-plan/src/spill/in_progress_spill_file.rs:
##########
@@ -51,16 +54,25 @@ impl InProgressSpillFile {
/// Appends a `RecordBatch` to the spill file, initializing the writer if
necessary.
///
+ /// Before writing, performs GC on StringView/BinaryView arrays to compact
backing
Review Comment:
FWIW I think the same general approach might be useful / needed for other
"view" type arrays -- I am thinking sliced LIstView for example as well as
sliced ListArray and sliced Utf8 🤔
Maybe as a follow on PR
##########
datafusion/physical-plan/src/spill/mod.rs:
##########
@@ -344,6 +349,174 @@ fn get_max_alignment_for_schema(schema: &Schema) -> usize
{
max_alignment
}
+/// Size of a single view structure in StringView/BinaryView arrays (in bytes).
+/// Each view is 16 bytes: 4 bytes length + 4 bytes prefix + 8 bytes buffer
ID/offset.
+const VIEW_SIZE_BYTES: usize = 16;
+
+/// Performs garbage collection on StringView and BinaryView arrays before
spilling to reduce memory usage.
+///
+/// # Why GC is needed
+///
+/// StringView and BinaryView arrays can accumulate significant memory waste
when sliced.
+/// When a large array is sliced (e.g., taking first 100 rows of 1000), the
view array
+/// still references the original data buffers containing all 1000 rows of
data.
+///
+/// For example, in the ClickBench benchmark (issue #19414), repeated slicing
of StringView
+/// arrays resulted in 820MB of spill files that could be reduced to just 33MB
after GC -
+/// a 96% reduction in size.
+///
+/// # How it works
+///
+/// The GC process:
+/// 1. Identifies view arrays (StringView/BinaryView) in the batch
+/// 2. Checks if their data buffers exceed a memory threshold
+/// 3. If exceeded, calls the Arrow `gc()` method which creates new compact
buffers
+/// containing only the data referenced by the current views
+/// 4. Returns a new batch with GC'd arrays (or original arrays if GC not
needed)
+///
+/// # When GC is triggered
+///
+/// GC is only performed when data buffers exceed a threshold (currently 10KB).
+/// This balances memory savings against the CPU overhead of garbage
collection.
+/// Small arrays are passed through unchanged since the GC overhead would
exceed
+/// any memory savings.
+///
+/// # Performance considerations
+///
+/// - If no view arrays need compaction, the original batch is cloned cheaply
+/// - GC is skipped for small buffers to avoid unnecessary CPU overhead
+/// - Nested container types are traversed recursively so view arrays inside
+/// `List`, `Map`, `Union`, `Dictionary`, and other child-bearing arrays are
compacted too
+/// - The Arrow `gc()` method itself is optimized and only copies referenced
data
+pub(crate) fn gc_view_arrays(batch: &RecordBatch) -> Result<RecordBatch> {
+ let mut mutated = false;
+ let mut new_columns: Vec<Arc<dyn Array>> =
Vec::with_capacity(batch.num_columns());
+
+ for array in batch.columns() {
+ let (gc_array, array_mutated) = gc_array(array)?;
+ mutated |= array_mutated;
+ new_columns.push(gc_array);
+ }
+
+ if mutated {
+ Ok(RecordBatch::try_new(batch.schema(), new_columns)?)
+ } else {
+ Ok(batch.clone())
+ }
+}
+
+fn gc_array(array: &ArrayRef) -> Result<(ArrayRef, bool)> {
+ match array.data_type() {
+ DataType::Utf8View => {
+ let string_view = array
+ .as_any()
+ .downcast_ref::<StringViewArray>()
+ .expect("Utf8View array should downcast to StringViewArray");
+ if should_gc_view_array(string_view) {
+ Ok((Arc::new(string_view.gc()) as ArrayRef, true))
+ } else {
+ Ok((Arc::clone(array), false))
+ }
+ }
+ DataType::BinaryView => {
+ let binary_view = array
+ .as_any()
+ .downcast_ref::<BinaryViewArray>()
+ .expect("BinaryView array should downcast to BinaryViewArray");
+ if should_gc_view_array(binary_view) {
+ Ok((Arc::new(binary_view.gc()) as ArrayRef, true))
+ } else {
+ Ok((Arc::clone(array), false))
+ }
+ }
+ _ => gc_array_children(array),
+ }
+}
+
+fn gc_array_children(array: &ArrayRef) -> Result<(ArrayRef, bool)> {
+ let data = array.to_data();
Review Comment:
FWIW to_data is not free (it allocates a vec, etc)
But I see the need to traverse a nested array and gc the whole thing. Short
of adding specific code for each array type I think using ArrayData is the best
we can do
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]