etseidl commented on PR #6159:
URL: https://github.com/apache/arrow-rs/pull/6159#issuecomment-2269758557

   I've done some more performance tweaking. By reworking 
`VariableWidthByteStreamSplitEncoder::put()` I've managed to get some pretty 
good speed ups on the encoding side. This is comparing to a baseline of the 
current state of my bss branch. I've left in the float benches for comparison, 
and then have results for `FixedLenByteArray(n)` where `n = 2, 4-8, 16`.
   ```
   encoding: dtype=f32, encoding=BYTE_STREAM_SPLIT
                           time:   [43.710 µs 43.941 µs 44.221 µs]
                           change: [-1.6776% -0.6826% +0.2648%] (p = 0.18 > 
0.05)
                           No change in performance detected.
   Found 3 outliers among 100 measurements (3.00%)
     3 (3.00%) high mild
   
   encoding: dtype=f64, encoding=BYTE_STREAM_SPLIT
                           time:   [111.19 µs 111.97 µs 112.79 µs]
                           change: [-2.5753% -1.3409% -0.1116%] (p = 0.04 < 
0.05)
                           Change within noise threshold.
   Found 5 outliers among 100 measurements (5.00%)
     4 (4.00%) high mild
     1 (1.00%) high severe
   
   encoding: dtype=parquet::data_type::FixedLenByteArray(2), 
encoding=BYTE_STREAM_SPLIT
                           time:   [49.573 µs 50.004 µs 50.432 µs]
                           change: [-53.988% -53.597% -53.183%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 1 outliers among 100 measurements (1.00%)
     1 (1.00%) high mild
   
   encoding: dtype=parquet::data_type::FixedLenByteArray(4), 
encoding=BYTE_STREAM_SPLIT #2
                           time:   [84.666 µs 85.319 µs 86.056 µs]
                           change: [-44.200% -43.653% -43.183%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 1 outliers among 100 measurements (1.00%)
     1 (1.00%) high mild
   
   encoding: dtype=parquet::data_type::FixedLenByteArray(5), 
encoding=BYTE_STREAM_SPLIT #3
                           time:   [108.97 µs 109.44 µs 110.03 µs]
                           change: [-38.164% -37.665% -37.185%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 1 outliers among 100 measurements (1.00%)
     1 (1.00%) high mild
   
   encoding: dtype=parquet::data_type::FixedLenByteArray(6), 
encoding=BYTE_STREAM_SPLIT #4
                           time:   [128.91 µs 129.86 µs 130.99 µs]
                           change: [-32.994% -32.088% -31.191%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 1 outliers among 100 measurements (1.00%)
     1 (1.00%) high mild
   
   encoding: dtype=parquet::data_type::FixedLenByteArray(7), 
encoding=BYTE_STREAM_SPLIT #5
                           time:   [157.03 µs 158.05 µs 159.18 µs]
                           change: [-29.519% -28.944% -28.346%] (p = 0.00 < 
0.05)
                           Performance has improved.
   
   encoding: dtype=parquet::data_type::FixedLenByteArray(8), 
encoding=BYTE_STREAM_SPLIT #6
                           time:   [168.02 µs 171.47 µs 176.56 µs]
                           change: [-6.5555% -5.5390% -4.2909%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 7 outliers among 100 measurements (7.00%)
     5 (5.00%) high mild
     2 (2.00%) high severe
   
   encoding: dtype=parquet::data_type::FixedLenByteArray(16), 
encoding=BYTE_STREAM_SPLIT #7
                           time:   [898.95 µs 900.20 µs 901.59 µs]
                           change: [-0.7839% -0.2549% +0.2553%] (p = 0.36 > 
0.05)
                           No change in performance detected.
   Found 6 outliers among 100 measurements (6.00%)
     2 (2.00%) high mild
     4 (4.00%) high severe
   ``` 
   
   The new code replaces the current put logic
   ```rust
   values.iter().for_each(|x| {
       let bytes = x.as_bytes();
       ...
       self.buffer.extend(bytes)
   });
   ```
   with a parameterized function
   ```rust
   fn put_fixed<T: DataType, const TYPE_SIZE: usize>(dst: &mut [u8], values: 
&[T::T]) {
       let mut idx = 0;
       values.iter().for_each(|x| {
           let bytes = x.as_bytes();
           ...
           for i in 0..TYPE_SIZE {
               dst[idx + i] = bytes[i]
           }
           idx += TYPE_SIZE;
       });
   }
   ```
   for `n <= 8`. Over 8 bytes it seems better to not use a loop (although the 
`extend()` is replaced with `copy_from_slice()`).
   
   I'll push the new code once I have a roundtrip test to make sure it's 
working correctly. I also want to benchmark on a faster machine.
   
   In a subsequent PR I think I'll try tackling a more cache friendly transpose 
for `type_size > 8` to see if I can get the `FLBA(16)` numbers down some.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to