[I] Improve performance of constructing `ByteView`s for small strings [arrow-rs]

via GitHub Wed, 10 Jul 2024 03:34:54 -0700


alamb opened a new issue, #6034:
URL: https://github.com/apache/arrow-rs/issues/6034


   
   Part of https://github.com/apache/arrow-rs/issues/5374
   
   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   NoteL This is copied from the wonderful description on  
https://github.com/apache/arrow-rs/pull/6031 from @XiangpengHao  and @aoil-al
   
   In our benchmark, [every string is larger than 12 
bytes](https://github.com/XiangpengHao/arrow-rs/blob/view-buffer/parquet/benches/arrow_reader.rs#L288),
 this means that when making the views, we always fall into [this 
branch](https://github.com/XiangpengHao/arrow-rs/blob/view-buffer/arrow-array/src/builder/generic_bytes_view_builder.rs#L419-L424).
 The specialty about this branch is that it only reads the first 4 bytes of the 
string, and LLVM is smart enough to inline loading the values (i.e., prevent 
calling into `copy_non_overlapping`). 
   
   #### Is that a big deal?
   Unfortunately, yes. If you change the benchmark string to very small, e.g., 
`"test"`, you should be able to see (using the same benchmark script above) the 
performance of loading `ViewArray` drops significantly, and it doesn't make 
sense -- how could building `StringViewArray`s wit shorter strings be so much 
slower than building longer strings?
   
   The root cause is that when string is short, we need to `memcpy` the bytes 
to the `view`, as the compiler has no idea how long the string is, it can not 
inline the bytes and has to call to `memcpy`, which is slow.
   
   **Describe the solution you'd like**
   
   I would like to improve the performance of StringView creation more
   
   **Describe alternatives you've considered**
   
   here's a special variant of `make_view`, you can replace [this 
function](https://github.com/apache/arrow-rs/blob/master/arrow-array/src/builder/generic_bytes_view_builder.rs#L411)
 with the following new implementation, and the performance improves. 
   
   This new implementation makes 13 copies of make_view_inner; each copy is a 
specialized length known to the compiler (`LEN` is const), and then the 
compiler can optimize each of the implementations as needed. Looking at the 
assembly code, there is indeed no call to `memcpy`.
   
   ```rust
   /// Creates a view from a fixed length input (the compiler can generate
   /// specialized code for this)
   fn make_view_inner<const LEN: usize>(data: &[u8]) -> u128 {
       let mut view_buffer = [0; 16];
       view_buffer[0..4].copy_from_slice(&(LEN as u32).to_le_bytes());
       view_buffer[4..4 + LEN].copy_from_slice(&data[..LEN]);
       u128::from_le_bytes(view_buffer)
   }
   
   /// Create a view based on the given data, block id and offset
   #[inline(always)]
   pub fn make_view(data: &[u8], block_id: u32, offset: u32) -> u128 {
       let len = data.len() as u32;
       // Generate specialized code for each potential small string length
       // to improve performance
       match len {
           0 => make_view_inner::<0>(data),
           1 => make_view_inner::<1>(data),
           2 => make_view_inner::<2>(data),
           3 => make_view_inner::<3>(data),
           4 => make_view_inner::<4>(data),
           5 => make_view_inner::<5>(data),
           6 => make_view_inner::<6>(data),
           7 => make_view_inner::<7>(data),
           8 => make_view_inner::<8>(data),
           9 => make_view_inner::<9>(data),
           10 => make_view_inner::<10>(data),
           11 => make_view_inner::<11>(data),
           12 => make_view_inner::<12>(data),
           _ => {
               let view = ByteView {
                   length: len,
                   prefix: u32::from_le_bytes(data[0..4].try_into().unwrap()),
                   buffer_index: block_id,
                   offset,
               };
               view.into()
           }
       }
   }
   
   ```
    (special thanks to @aoli-al for triangulating the root cause and prototype 
the fix)
   
   
   While the above is somewhat non intuitive, if it improves performance and is 
sufficiently commented, it would be a good thing to add
   
   
   
   **Additional context**
   <!--
   Add any other context or screenshots about the feature request here.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Improve performance of constructing `ByteView`s for small strings [arrow-rs]

Reply via email to