XiangpengHao commented on code in PR #11587: URL: https://github.com/apache/datafusion/pull/11587#discussion_r1686531840
########## datafusion/physical-plan/src/coalesce_batches.rs: ########## @@ -216,6 +218,41 @@ impl CoalesceBatchesStream { match input_batch { Poll::Ready(x) => match x { Some(Ok(batch)) => { + let new_columns: Vec<Arc<dyn Array>> = batch + .columns() + .iter() + .map(|c| { + // Try to re-create the `StringViewArray` to prevent holding the underlying buffer too long. + if let Some(s) = c.as_string_view_opt() { + let view_cnt = s.views().len(); + let buffer_size = s.get_buffer_memory_size(); + + // Re-creating the array copies data and can be time consuming. + // We only do it if the array is sparse, below is a heuristic to determine if the array is sparse. + if buffer_size > (view_cnt * 32) { Review Comment: Yes, this is a rough heuristic to test whether the buffer is dense. In fact, there's no way for us to know if anything has been filtered in `CoalesceBatchesExec` -- we only know this in downstream operators like `FilterExec`. The fact that we have a `CoalesceBatchesExec` here means that it is likely that the batch has been sparse and thus requires gc. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org