alamb opened a new issue, #11628:
URL: https://github.com/apache/datafusion/issues/11628

   ### Is your feature request related to a problem or challenge?
   
   In pictures, what  https://github.com/apache/datafusion/pull/11587 does is 
like
   this (to ensure lots of unreachable "garbage" does not accumulate in the 
output batch)
   
   ```
   ┌────────────────────┐
   │    RecordBatch     │                ┌────────────────────┐
   │   num_rows = 23    │                │    RecordBatch     │
   └────────────────────┘                │   num_rows = 23    │              
┌────────────────────┐
                                         └────────────────────┘              │  
                  │
   ┌────────────────────┐                                        Coalesce    │  
                  │
   │                    │ StringView::gc ┌────────────────────┐   Batches    │  
                  │
   │    RecordBatch     │                │    RecordBatch     │              │  
                  │
   │   num_rows = 50    │                │   num_rows = 50    │  ─ ─ ─ ─ ─▶  │  
                  │
   │                    │    ─ ─ ─ ─ ─▶  │                    │              │  
  RecordBatch     │
   │                    │                └────────────────────┘              │  
 num_rows = 106   │
   └────────────────────┘                                                    │  
                  │
                                                                             │  
                  │
   ┌────────────────────┐                ┌────────────────────┐              │  
                  │
   │                    │                │    RecordBatch     │              │  
                  │
   │    RecordBatch     │                │   num_rows = 33    │              │  
                  │
   │   num_rows = 33    │                │                    │              
└────────────────────┘
   │                    │                └────────────────────┘
   └────────────────────┘
   ```
   
   However, as @2010YOUY01 pointed out in  
https://github.com/apache/datafusion/pull/11587/files#r1686678665
   
   > So here inside gc string buffer will be copied once, (below) in
   > concat_batches() string buffer will be copied again, it seems possible to 
copy
   > only once by changing the internal implementation of concat_batches()
   
   This implementation will effectively copy the data twice -- once for the 
call to
   `gc` and once for the call coalsece batches.
   
   Due to the nature of `StringView` the actual strings vaules are only copied 
once, but the `u128` view value will be copied twice
   
   
   ### Describe the solution you'd like
   
   Somehow structure the code to avoid copying the views again. Like this
   
   ```
   ┌────────────────────┐                                       
   │    RecordBatch     │                                       
   │   num_rows = 23    │                 ┌────────────────────┐
   └────────────────────┘                 │                    │
                          StringView::gc  │                    │
   ┌────────────────────┐  and Coalesce   │                    │
   │                    │ Batches in same │                    │
   │    RecordBatch     │    operation    │                    │
   │   num_rows = 50    │                 │    RecordBatch     │
   │                    │    ─ ─ ─ ─ ─▶   │   num_rows = 106   │
   │                    │                 │                    │
   └────────────────────┘                 │                    │
                                          │                    │
   ┌────────────────────┐                 │                    │
   │                    │                 │                    │
   │    RecordBatch     │                 └────────────────────┘
   │   num_rows = 33    │                                       
   │                    │                                       
   └────────────────────┘                                       
   ```
   
   ### Describe alternatives you've considered
   
   https://github.com/apache/datafusion/pull/11587/files#r1687099239
   
   > I think given how concat is implemented for `StringView` it will only copy 
the fixed parts (not the actual string data)
   > 
   > Perhaps what we could do is implement a wrapper around 
arrow::concat_batches that has the datafusion specific GC trigger for sparse 
arrays, and falls back to concat for other types: 
https://docs.rs/arrow-select/52.1.0/src/arrow_select/concat.rs.html#150
   > 
   > ```rust
   > /// wrapper around [`arrow::compute::concat`] that 
   > pub fn concat(arrays: &[&dyn Array]) -> Result<ArrayRef, ArrowError> {
   >  // loop over columns here and handle StringView specially, 
   >  // or fallback to concat
   >  }
   > ```
   
   
   
   ### Additional context
   
   https://github.com/apache/datafusion/issues/7957 is another related idea for 
avoding copies


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to