XiangpengHao commented on code in PR #11587:
URL: https://github.com/apache/datafusion/pull/11587#discussion_r1686536860


##########
datafusion/physical-plan/src/coalesce_batches.rs:
##########
@@ -216,6 +218,41 @@ impl CoalesceBatchesStream {
             match input_batch {
                 Poll::Ready(x) => match x {
                     Some(Ok(batch)) => {
+                        let new_columns: Vec<Arc<dyn Array>> = batch
+                            .columns()
+                            .iter()
+                            .map(|c| {
+                                // Try to re-create the `StringViewArray` to 
prevent holding the underlying buffer too long.
+                                if let Some(s) = c.as_string_view_opt() {
+                                    let view_cnt = s.views().len();
+                                    let buffer_size = 
s.get_buffer_memory_size();
+
+                                    // Re-creating the array copies data and 
can be time consuming.
+                                    // We only do it if the array is sparse, 
below is a heuristic to determine if the array is sparse.
+                                    if buffer_size > (view_cnt * 32) {
+                                        // We use a block size of 2MB (instead 
of 8KB) to reduce the number of buffers to track.
+                                        // See 
https://github.com/apache/arrow-rs/issues/6094 for more details.
+                                        let mut builder =

Review Comment:
   Deduplication hashes the string values, which has quite high overhead. Here 
we are processing small batches (default size 8192) and then concatenating them 
to a larger batch. Deduplicating on small batches gives us small benefits.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to