alamb commented on code in PR #11587: URL: https://github.com/apache/datafusion/pull/11587#discussion_r1687100275
########## datafusion/physical-plan/src/coalesce_batches.rs: ########## @@ -290,6 +294,46 @@ pub fn concat_batches( arrow::compute::concat_batches(schema, batches) } +/// `StringViewArray` reference to the raw parquet decoded buffer, which reduces copy but prevents those buffer from being released. +/// When `StringViewArray`'s cardinality significantly drops (e.g., after `FilterExec` or `HashJoinExec` or many others), +/// we should consider consolidating it so that we can release the buffer to reduce memory usage and improve string locality for better performance. +fn gc_string_view_batch(batch: &RecordBatch) -> RecordBatch { + let new_columns: Vec<ArrayRef> = batch + .columns() + .iter() + .map(|c| { + // Try to re-create the `StringViewArray` to prevent holding the underlying buffer too long. + if let Some(s) = c.as_string_view_opt() { Review Comment: A minor comment here is that I think you can reduce the level of nesting by using syntax like ```rust let Some(s) = c.as_string_view_opt() else { return Arc::clone(c) } ``` ########## datafusion/physical-plan/src/coalesce_batches.rs: ########## @@ -216,6 +218,8 @@ impl CoalesceBatchesStream { match input_batch { Poll::Ready(x) => match x { Some(Ok(batch)) => { + let batch = gc_string_view_batch(&batch); Review Comment: This is an excellent point I think given how concat is implemented for `StringView` it will only copy the fixed parts (not the actual string data) Perhaps what we could do is implement a wrapper around arrow::concat_batches that has the datafusion specific GC trigger for sparse arrays, and falls back to concat for other types: https://docs.rs/arrow-select/52.1.0/src/arrow_select/concat.rs.html#150 ```rust /// wrapper around [`arrow::compute::concat`] that pub fn concat(arrays: &[&dyn Array]) -> Result<ArrayRef, ArrowError> { // loop over columns here and handle StringView specially, // or fallback to concat } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org