2010YOUY01 commented on code in PR #11587:
URL: https://github.com/apache/datafusion/pull/11587#discussion_r1685926607


##########
datafusion/physical-plan/src/coalesce_batches.rs:
##########
@@ -216,6 +218,41 @@ impl CoalesceBatchesStream {
             match input_batch {
                 Poll::Ready(x) => match x {
                     Some(Ok(batch)) => {
+                        let new_columns: Vec<Arc<dyn Array>> = batch
+                            .columns()
+                            .iter()
+                            .map(|c| {
+                                // Try to re-create the `StringViewArray` to 
prevent holding the underlying buffer too long.
+                                if let Some(s) = c.as_string_view_opt() {
+                                    let view_cnt = s.views().len();
+                                    let buffer_size = 
s.get_buffer_memory_size();
+
+                                    // Re-creating the array copies data and 
can be time consuming.
+                                    // We only do it if the array is sparse, 
below is a heuristic to determine if the array is sparse.
+                                    if buffer_size > (view_cnt * 32) {

Review Comment:
   If there are lots of long strings in the buffer, will it be triggered every 
time even if nothing has been filtered?



##########
datafusion/physical-plan/src/coalesce_batches.rs:
##########
@@ -216,6 +218,41 @@ impl CoalesceBatchesStream {
             match input_batch {
                 Poll::Ready(x) => match x {
                     Some(Ok(batch)) => {
+                        let new_columns: Vec<Arc<dyn Array>> = batch
+                            .columns()
+                            .iter()
+                            .map(|c| {
+                                // Try to re-create the `StringViewArray` to 
prevent holding the underlying buffer too long.
+                                if let Some(s) = c.as_string_view_opt() {
+                                    let view_cnt = s.views().len();
+                                    let buffer_size = 
s.get_buffer_memory_size();
+
+                                    // Re-creating the array copies data and 
can be time consuming.
+                                    // We only do it if the array is sparse, 
below is a heuristic to determine if the array is sparse.
+                                    if buffer_size > (view_cnt * 32) {
+                                        // We use a block size of 2MB (instead 
of 8KB) to reduce the number of buffers to track.
+                                        // See 
https://github.com/apache/arrow-rs/issues/6094 for more details.
+                                        let mut builder =

Review Comment:
   Here builder didn't use the deduplication mechanism, is that the case we can 
assume that `StringViewArray` is already deduplicated before coalescing?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to