Re: [I] Reduce copying in `CoalesceBatchesExec` for StringViews [datafusion]

via GitHub Tue, 23 Jul 2024 15:50:01 -0700


XiangpengHao commented on issue #11628:
URL: https://github.com/apache/datafusion/issues/11628#issuecomment-2246432690


   Got some time to think about this and want to share my thoughts here:
   
   #### Implementation
   The goal is to reduce copying string, specifically, only copying string once 
and only constructing view array once.
   
   I implemented the gc in `concat_batches` on my local branch, the code looks 
like this:
   
   ```rust
       for i in 0..field_num {
           let data_type = schema.field(i).data_type();
           match data_type {
               &arrow_schema::DataType::Utf8View => {
                   let mut string_view_builder = 
StringViewBuilder::with_capacity(row_count)
                       .with_block_size(1024 * 1024 * 2);
                   for b in batches.iter() {
                       let array = b.column(i).as_string_view();
                       for v in array.iter() {
                           string_view_builder.append_option(v);
                       }
                   }
                   let array = string_view_builder.finish();
                   arrays.push(Arc::new(array) as ArrayRef);
               }
               _ => {
                   let array = arrow::compute::concat(
                       &batches
                           .iter()
                           .map(|batch| batch.column(i).as_ref())
                           .collect::<Vec<_>>(),
                   )?;
                   arrays.push(array);
               }
           }
       }
   ```
   
   Benchmark this implement on ClickBench Q20:
   ```sql
   SELECT COUNT(*) FROM hits WHERE "URL" LIKE '%google%';
   ```
   
   The performance is slower by about 20%.
   
   #### Profiling
   I checked the flamegraph and found the new implementation takes 
significantly more time on page fault.
   <img width="792" alt="image" 
src="https://github.com/user-attachments/assets/ac4cae52-d521-49fb-aedd-781edd636ea2";>
   
   Then I ran heaptrack (need to disable `mimalloc`) and found the peak RSS 
(Peak Resident Set Size) increased from 1.3GB to 5.2GB.
   
   I believe the performance regression is due to **late GC**. Previously, we 
called `GC` immediately after we ran the filter. Now, we call GC only after we 
accumulate enough values in the buffer, which can hold the underlying buffer 
for an excessively long time.
   
   #### Solution?
   
   I think the discussion around reduce copying can be divide into two sub 
questions:
   1. what is the current overhead of StringView in `CoalesceBatchesExec`? The 
current implementation does not copy string data. The overhead (extra steps) 
comes from that we constructed the view three times (one in the filter step, 
one in the coalesce gc, one in the `concate_batches`). The implementation above 
gets rid of the second one, but it is done in an improper timing. 
   2. should we refactor filter-then-coalesce into one operator? In that way, 
we don't have intermediate small batches, thus reduce copy. I believe this is a 
bigger project and can potentially solve the first problem along the way.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Reduce copying in `CoalesceBatchesExec` for StringViews [datafusion]

Reply via email to