XiangpengHao commented on code in PR #11587:
URL: https://github.com/apache/datafusion/pull/11587#discussion_r1686531840


##########
datafusion/physical-plan/src/coalesce_batches.rs:
##########
@@ -216,6 +218,41 @@ impl CoalesceBatchesStream {
             match input_batch {
                 Poll::Ready(x) => match x {
                     Some(Ok(batch)) => {
+                        let new_columns: Vec<Arc<dyn Array>> = batch
+                            .columns()
+                            .iter()
+                            .map(|c| {
+                                // Try to re-create the `StringViewArray` to 
prevent holding the underlying buffer too long.
+                                if let Some(s) = c.as_string_view_opt() {
+                                    let view_cnt = s.views().len();
+                                    let buffer_size = 
s.get_buffer_memory_size();
+
+                                    // Re-creating the array copies data and 
can be time consuming.
+                                    // We only do it if the array is sparse, 
below is a heuristic to determine if the array is sparse.
+                                    if buffer_size > (view_cnt * 32) {

Review Comment:
   Yes, this is a rough heuristic to test whether the buffer is dense.
   In fact, there's no way for us to know if anything has been filtered in 
`CoalesceBatchesExec` -- we only know this in downstream operators like 
`FilterExec`. The fact that we have a `CoalesceBatchesExec` here means that it 
is likely that the batch has been sparse and thus requires gc.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to