2010YOUY01 commented on code in PR #11587: URL: https://github.com/apache/datafusion/pull/11587#discussion_r1685926607
########## datafusion/physical-plan/src/coalesce_batches.rs: ########## @@ -216,6 +218,41 @@ impl CoalesceBatchesStream { match input_batch { Poll::Ready(x) => match x { Some(Ok(batch)) => { + let new_columns: Vec<Arc<dyn Array>> = batch + .columns() + .iter() + .map(|c| { + // Try to re-create the `StringViewArray` to prevent holding the underlying buffer too long. + if let Some(s) = c.as_string_view_opt() { + let view_cnt = s.views().len(); + let buffer_size = s.get_buffer_memory_size(); + + // Re-creating the array copies data and can be time consuming. + // We only do it if the array is sparse, below is a heuristic to determine if the array is sparse. + if buffer_size > (view_cnt * 32) { Review Comment: If there are lots of long strings in the buffer, will it be triggered every time even if nothing has been filtered? ########## datafusion/physical-plan/src/coalesce_batches.rs: ########## @@ -216,6 +218,41 @@ impl CoalesceBatchesStream { match input_batch { Poll::Ready(x) => match x { Some(Ok(batch)) => { + let new_columns: Vec<Arc<dyn Array>> = batch + .columns() + .iter() + .map(|c| { + // Try to re-create the `StringViewArray` to prevent holding the underlying buffer too long. + if let Some(s) = c.as_string_view_opt() { + let view_cnt = s.views().len(); + let buffer_size = s.get_buffer_memory_size(); + + // Re-creating the array copies data and can be time consuming. + // We only do it if the array is sparse, below is a heuristic to determine if the array is sparse. + if buffer_size > (view_cnt * 32) { + // We use a block size of 2MB (instead of 8KB) to reduce the number of buffers to track. + // See https://github.com/apache/arrow-rs/issues/6094 for more details. + let mut builder = Review Comment: Here builder didn't use the deduplication mechanism, is that the case we can assume that `StringViewArray` is already deduplicated before coalescing? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org