[PR] POC: Benchmark for AVX512-VBMI based bitpack decoding for a bitwidth of 1 [arrow-rs]

via GitHub Sat, 09 Aug 2025 09:56:53 -0700


jhorstmann opened a new pull request, #8102:
URL: https://github.com/apache/arrow-rs/pull/8102


   # Which issue does this PR close?
   
   This is a proof of concept for possibly the fastest way of decoding 
bitpacked data, and also a showcase why writing short rle runs can be 
detrimental to performance if decoding of bitpacked data is fast.
   
   - Related to #7739.
   
   # Rationale for this change
   
   At the moment I'm not proposing to integrate this into the arrow codebase, 
the code would need further changes to support processing arbitrary batch sizes 
and is currently also only supporting a single bitwidth and only `u8` as the 
target data type.
   
   # What changes are included in this PR?
   
   The benchmark includes a custom rle/bitpacking hybrid encoder that only 
supports a bitwidth of 1 and only writes bitpacked runs. On mostly random input 
data, the size of the encoded buffer is comparable to the size created by the 
standard `RleEncoder`. Decoding using standard `RleDecoder` also shows that 
decoding bitpacked data is slightly faster.
   
   To run the benchmarks, you will need Rust 1.89, which stabilized avx512 
support, and an avx512-capable machine (at least Intel Icelake or AMD Zen4).
   
   ```
   $ RUSTFLAGS="-Ctarget-cpu=native" cargo bench --features experimental 
--bench rle
   ```
   ```
   rle_decoder/decode_bitpacked
                           time:   [398.06 µs 398.66 µs 399.32 µs]
                           thrpt:  [2.6259 Gelem/s 2.6302 Gelem/s 2.6342 
Gelem/s]
   rle_decoder/decode_hybrid
                           time:   [540.05 µs 542.22 µs 544.84 µs]
                           thrpt:  [1.9246 Gelem/s 1.9339 Gelem/s 1.9416 
Gelem/s]
   ```
   
   The results are more interesting when decoding with a custom, AVX512-VBMI 
optimized decoder:
   
   ```
   custom/decode_bitpacked time:   [17.642 µs 17.661 µs 17.683 µs]
                           thrpt:  [59.297 Gelem/s 59.372 Gelem/s 59.435 
Gelem/s]
   custom/decode_hybrid    time:   [87.593 µs 87.866 µs 88.184 µs]
                           thrpt:  [11.891 Gelem/s 11.934 Gelem/s 11.971 
Gelem/s]
   ```
   
   Decoding bitpacked data gets a speedup of about 22x, while decoding hybrid 
rle data *only* gets about 6x faster. My guess would be that this is caused by 
branch prediction, or the call to a `memset`-like function which is not 
optimized for short data.
   
   
   # Are these changes tested?
   
   
   # Are there any user-facing changes?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] POC: Benchmark for AVX512-VBMI based bitpack decoding for a bitwidth of 1 [arrow-rs]

Reply via email to