alamb commented on code in PR #11587:
URL: https://github.com/apache/datafusion/pull/11587#discussion_r1687100275


##########
datafusion/physical-plan/src/coalesce_batches.rs:
##########
@@ -290,6 +294,46 @@ pub fn concat_batches(
     arrow::compute::concat_batches(schema, batches)
 }
 
+/// `StringViewArray` reference to the raw parquet decoded buffer, which 
reduces copy but prevents those buffer from being released.
+/// When `StringViewArray`'s cardinality significantly drops (e.g., after 
`FilterExec` or `HashJoinExec` or many others),
+/// we should consider consolidating it so that we can release the buffer to 
reduce memory usage and improve string locality for better performance.
+fn gc_string_view_batch(batch: &RecordBatch) -> RecordBatch {
+    let new_columns: Vec<ArrayRef> = batch
+        .columns()
+        .iter()
+        .map(|c| {
+            // Try to re-create the `StringViewArray` to prevent holding the 
underlying buffer too long.
+            if let Some(s) = c.as_string_view_opt() {

Review Comment:
   A minor comment here is that I think you can reduce the level of nesting by 
using syntax like
   
   ```rust
   let Some(s) = c.as_string_view_opt() else {
     return Arc::clone(c)
   }
   ```
     



##########
datafusion/physical-plan/src/coalesce_batches.rs:
##########
@@ -216,6 +218,8 @@ impl CoalesceBatchesStream {
             match input_batch {
                 Poll::Ready(x) => match x {
                     Some(Ok(batch)) => {
+                        let batch = gc_string_view_batch(&batch);

Review Comment:
   This is an excellent point
   
   I think given how concat is implemented for `StringView` it will only copy 
the fixed parts (not the actual string data)
   
   Perhaps what we could do is implement a wrapper around arrow::concat_batches 
that has the datafusion specific GC trigger for sparse arrays, and falls back 
to concat for other types: 
https://docs.rs/arrow-select/52.1.0/src/arrow_select/concat.rs.html#150
   
   ```rust
   /// wrapper around [`arrow::compute::concat`] that 
   pub fn concat(arrays: &[&dyn Array]) -> Result<ArrayRef, ArrowError> {
    // loop over columns here and handle StringView specially, 
    // or fallback to concat
    }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to