XiangpengHao opened a new pull request, #6031:
URL: https://github.com/apache/arrow-rs/pull/6031

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and 
enhancements and this helps us generate change logs for our releases. You can 
link an issue to this PR using the GitHub syntax. For example `Closes #123` 
indicates that this PR will close issue #123.
   -->
   
   Closes #.
   
   # Rationale for this change
    I made a mistake in a previous pr which writes:
   ```rust
   let block_id = output.append_block(self.buf.clone().into());
   ``` 
   I thought the `into()` will convert the `Bytes` into 
`array_buffer::Buffer()` without copying the data.
   
   But `self.buf` is `bytes::Bytes`, not `arrow_buffer::Bytes` (they 
confusingly having the same name). The consequence is that the code above will 
run into this procedure: 
https://github.com/XiangpengHao/arrow-rs/blob/view-buffer/arrow-buffer/src/buffer/immutable.rs#L361-L370,
 which implicitly copies data.
   
   (I think we should do something to prevent future similar mistakes, but I 
don't know how).
   
   
   <!--
   Why are you proposing this change? If this is already explained clearly in 
the issue then this section is not needed.
   Explaining clearly why changes are proposed helps reviewers understand your 
changes and offer better suggestions for fixes.
   -->
   
   # What changes are included in this PR?
   This PR explicitly converts `bytes::Bytes` into `arrow_buffer::Bytes` and 
then converts to `arrow_buffer::Buffer`; I have manually verified that each of 
the conversions requires no copy.
   
   This further improves the benchmark performance:
   ```bash
   cargo bench --bench arrow_reader --features="arrow test_common experimental" 
"arrow_array_reader/BinaryViewArray/"
   ```
   
   ```
   arrow_array_reader/BinaryViewArray/plain encoded, mandatory, no NULLs
                           time:   [129.51 µs 129.57 µs 129.63 µs]
                           change: [-35.172% -35.134% -35.094%] (p = 0.00 < 
0.05)
                           Performance has improved.
   
   arrow_array_reader/BinaryViewArray/plain encoded, optional, no NULLs
                           time:   [131.07 µs 131.11 µs 131.15 µs]
                           change: [-59.798% -59.758% -59.723%] (p = 0.00 < 
0.05)
                           Performance has improved.
   
   arrow_array_reader/BinaryViewArray/plain encoded, optional, half NULLs
                           time:   [125.08 µs 125.11 µs 125.15 µs]
                           change: [-32.699% -32.660% -32.621%] (p = 0.00 < 
0.05)
                           Performance has improved.
   ```
   
   ### Bonus (not related to this PR)
   You may wonder how it is possible to be [faster than 
StringArray](https://github.com/apache/arrow-rs/pull/5972) if the current 
implementation still copies data. TLDR: bc `memcpy` inlining.
   
   In our benchmark, [every string is larger than 12 
bytes](https://github.com/XiangpengHao/arrow-rs/blob/view-buffer/parquet/benches/arrow_reader.rs#L288),
 this means that when making the views, we always fall into [this 
branch](https://github.com/XiangpengHao/arrow-rs/blob/view-buffer/arrow-array/src/builder/generic_bytes_view_builder.rs#L419-L424).
 The specialty about this branch is that it only reads the first 4 bytes of the 
string, and LLVM is smart enough to inline loading the values (i.e., prevent 
calling into `copy_non_overlapping`). 
   
   #### Is that a big deal?
   Unfortunately, yes. If you change the benchmark string to very small, e.g., 
`"test"`, you should be able to see (using the same benchmark script above) the 
performance of loading `ViewArray` drops significantly, and it doesn't make 
sense -- how could build shorter string much slower than building longer string?
   
   #### How to fix the short string regression?
   The root cause is that when string is short, we need to `memcpy` the bytes 
to the `view`, as the compiler has no idea how long the string is, it can not 
inline the bytes and has to call to `memcpy`, which is slow.
   To convince you more, here's a special variant of `make_view`, you can 
replace [this 
function](https://github.com/apache/arrow-rs/blob/master/arrow-array/src/builder/generic_bytes_view_builder.rs#L411)
 with the following new implementation, and the performance will back to 
normal. This new implementation makes 13 copies of make_view_inner; each copy 
is a specialized length known to the compiler (`LEN` is const), and then the 
compiler can optimize each of the implementations as needed. Looking at the 
assembly code, there is indeed no call to `memcpy`.
   
   ```rust
   fn make_view_inner<const LEN: usize>(data: &[u8]) -> u128 {
       let mut view_buffer = [0; 16];
       view_buffer[0..4].copy_from_slice(&(LEN as u32).to_le_bytes());
       view_buffer[4..4 + LEN].copy_from_slice(&data[..LEN]);
       u128::from_le_bytes(view_buffer)
   }
   
   /// Create a view based on the given data, block id and offset
   #[inline(always)]
   pub fn make_view(data: &[u8], block_id: u32, offset: u32) -> u128 {
       let len = data.len() as u32;
       match len {
           0 => make_view_inner::<0>(data),
           1 => make_view_inner::<1>(data),
           2 => make_view_inner::<2>(data),
           3 => make_view_inner::<3>(data),
           4 => make_view_inner::<4>(data),
           5 => make_view_inner::<5>(data),
           6 => make_view_inner::<6>(data),
           7 => make_view_inner::<7>(data),
           8 => make_view_inner::<8>(data),
           9 => make_view_inner::<9>(data),
           10 => make_view_inner::<10>(data),
           11 => make_view_inner::<11>(data),
           12 => make_view_inner::<12>(data),
           _ => {
               let view = ByteView {
                   length: len,
                   prefix: u32::from_le_bytes(data[0..4].try_into().unwrap()),
                   buffer_index: block_id,
                   offset,
               };
               view.into()
           }
       }
   }
   
   ```
    (special thanks to @aoli-al for triangulating the root cause and prototype 
the fix)
   
   What should we do?
   I don't know, I'm not sure if we want to merge that special `make_view`, as 
it is very unintuitive. I'll keep a local copy of this `make_view` and think 
about it in the background, maybe I'll have a better idea in a month.
   
   <!--
   There is no need to duplicate the description in the issue here but it is 
sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are there any user-facing changes?
   
   
   <!--
   If there are user-facing changes then we may require documentation to be 
updated before approving the PR.
   -->
   
   <!---
   If there are any breaking changes to public APIs, please add the `breaking 
change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to