etseidl commented on PR #6159:
URL: https://github.com/apache/arrow-rs/pull/6159#issuecomment-2269758557
I've done some more performance tweaking. By reworking
`VariableWidthByteStreamSplitEncoder::put()` I've managed to get some pretty
good speed ups on the encoding side. This is comparing to a baseline of the
current state of my bss branch. I've left in the float benches for comparison,
and then have results for `FixedLenByteArray(n)` where `n = 2, 4-8, 16`.
```
encoding: dtype=f32, encoding=BYTE_STREAM_SPLIT
time: [43.710 µs 43.941 µs 44.221 µs]
change: [-1.6776% -0.6826% +0.2648%] (p = 0.18 >
0.05)
No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high mild
encoding: dtype=f64, encoding=BYTE_STREAM_SPLIT
time: [111.19 µs 111.97 µs 112.79 µs]
change: [-2.5753% -1.3409% -0.1116%] (p = 0.04 <
0.05)
Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
4 (4.00%) high mild
1 (1.00%) high severe
encoding: dtype=parquet::data_type::FixedLenByteArray(2),
encoding=BYTE_STREAM_SPLIT
time: [49.573 µs 50.004 µs 50.432 µs]
change: [-53.988% -53.597% -53.183%] (p = 0.00 <
0.05)
Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
encoding: dtype=parquet::data_type::FixedLenByteArray(4),
encoding=BYTE_STREAM_SPLIT #2
time: [84.666 µs 85.319 µs 86.056 µs]
change: [-44.200% -43.653% -43.183%] (p = 0.00 <
0.05)
Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
encoding: dtype=parquet::data_type::FixedLenByteArray(5),
encoding=BYTE_STREAM_SPLIT #3
time: [108.97 µs 109.44 µs 110.03 µs]
change: [-38.164% -37.665% -37.185%] (p = 0.00 <
0.05)
Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
encoding: dtype=parquet::data_type::FixedLenByteArray(6),
encoding=BYTE_STREAM_SPLIT #4
time: [128.91 µs 129.86 µs 130.99 µs]
change: [-32.994% -32.088% -31.191%] (p = 0.00 <
0.05)
Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
encoding: dtype=parquet::data_type::FixedLenByteArray(7),
encoding=BYTE_STREAM_SPLIT #5
time: [157.03 µs 158.05 µs 159.18 µs]
change: [-29.519% -28.944% -28.346%] (p = 0.00 <
0.05)
Performance has improved.
encoding: dtype=parquet::data_type::FixedLenByteArray(8),
encoding=BYTE_STREAM_SPLIT #6
time: [168.02 µs 171.47 µs 176.56 µs]
change: [-6.5555% -5.5390% -4.2909%] (p = 0.00 <
0.05)
Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
5 (5.00%) high mild
2 (2.00%) high severe
encoding: dtype=parquet::data_type::FixedLenByteArray(16),
encoding=BYTE_STREAM_SPLIT #7
time: [898.95 µs 900.20 µs 901.59 µs]
change: [-0.7839% -0.2549% +0.2553%] (p = 0.36 >
0.05)
No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
2 (2.00%) high mild
4 (4.00%) high severe
```
The new code replaces the current put logic
```rust
values.iter().for_each(|x| {
let bytes = x.as_bytes();
...
self.buffer.extend(bytes)
});
```
with a parameterized function
```rust
fn put_fixed<T: DataType, const TYPE_SIZE: usize>(dst: &mut [u8], values:
&[T::T]) {
let mut idx = 0;
values.iter().for_each(|x| {
let bytes = x.as_bytes();
...
for i in 0..TYPE_SIZE {
dst[idx + i] = bytes[i]
}
idx += TYPE_SIZE;
});
}
```
for `n <= 8`. Over 8 bytes it seems better to not use a loop (although the
`extend()` is replaced with `copy_from_slice()`).
I'll push the new code once I have a roundtrip test to make sure it's
working correctly. I also want to benchmark on a faster machine.
In a subsequent PR I think I'll try tackling a more cache friendly transpose
for `type_size > 8` to see if I can get the `FLBA(16)` numbers down some.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]