HadrienG2 opened a new issue, #5771:
URL: https://github.com/apache/arrow-rs/issues/5771
**Describe the bug**
The bitmap access and manipulation functions in
`arrow_buffer::util::bit_util` do not follow the most obvious implementation
strategy...
```rust
pub fn get_bit(data: &[u8], i: usize) -> bool {
data[i / 8] & (1 << (i % 8)) != 0
}
```
Instead, they use table lookup techniques like...
```rust
const BIT_MASK: [u8; 8] = [1, 2, 4, 8, 16, 32, 64, 128];
pub fn get_bit(data: &[u8], i: usize) -> bool {
(data[i >> 3] & BIT_MASK[i & 7]) != 0
}
```
According to [the discussion within PR
5760](https://github.com/apache/arrow-rs/pull/5760/files/f14bfbe3eca72bcf1f92d8d2e5ba6623c0060092#diff-8435cfc9abe9dd33cabed382049204a64acc49ffd2b7ad84dfe88339e864151a),
this was demonstrated to be beneficial to performance during the development
of older versions of arrow.
However, in my testing with rustc 1.78 and a Zen 3 CPU (AMD Ryzen 5 5600G),
I have shown that modern rustc is in fact perfectly able of generating optimal
code from the simple version, and that the table lookup version can easily
degrade into worse performance if the compiler optimization process does not
perfectly succeed at eliminating its overhead.
Therefore, I think arrow should switch back to the obvious code, and will
submit a PR accordingly.
**To Reproduce**
The repository https://github.com/HadrienG2/bitmap-bench contains
microbenchmarks of three implementations of bit readout, setting and clearing:
- The simplest implementation, called `naive`.
- The current arrow implementation, called `const_table`.
- A variation of the current arrow implementation where the table is stored
inside of a static variable instead of a const, called `static_table` (the
rationale for this variant will be explained below).
These implementations are exercised by the benchmark on a bitmap which fits
into an L1 cache (which is when the performance of bit access functions has the
largest impact). Three access patterns are evaluated:
- The `hidden_constant` pattern repeatedly accesses the bitmap at widely
spaced indices which are hidden from the compiler's optimizer. This models a
scenario where the compiler optimizer does not manage to figure out the user's
data access pattern and must generate maximally pessimistic random bitmap
access code.
- The `linear_all` pattern accesses all bits from the bitmap from start to
finish, and exposes this fact to the compiler's optimizer. The optimizer should
thus be able to leverage this regular pattern for optimization, for example
accessing all bits of a bitmap should compile into a simple shift-and-test
loop, setting all the bits of a bitmap in order should compile into writing
`0xff` to each byte of the bitmap, and clearing all the bits of the bitmap in
order should compile into writing `0x00` to each byte of the bitmap.
- The `linear_strided` pattern is similar to `linear_all`, but accesses
every other bit instead of accessing every bit. This is a little bit harder on
the optimizer, but is still expected to compile into byte-wise operations
during writes (byte |= 0b01010101 for setting, byte &= 0b10101010 for clearing).
You can run the benchmark workloads with `cargo bench`, and on my machine
(rustc 1.78, AMD Ryzen 5 5600G CPU), this yields the following results:
- The simple implementation is almost always the fastest (within margin of
measurement error), except in one scenario (clearing all bits) where it was
measured a bit slower. This scenario was further investigated, and it turns out
that the compiler generated assembly which is rigorously identical to the
measured-fastest implementation in that scenario, so my understanding is that
the speed difference comes from a benchmark artifact like a code alignment
issue that is not relevant to real-world usage.
- The current table lookup implementation can be as fast as the simple
implementation when the compiler manages to optimize out the table and compile
it into in-register bit-testing code. But because the table is a const, which
does not have a fixed memory location in the program address space, whenever
this optimization process does not go perfectly smoothly, the compiler must
emit an instruction which writes the table to the program stack, which is quite
costly, followed by table lookups which are also costlier than in-register bit
testing.
- I tried to evaluate the impact of table writes by writing an alternate
version of the table lookup implementation where the table is a static variable
instead of a const. As expected, this implementation never suffers from table
write overheads, however it still suffers from table lookup overhead (which, as
mentioned earlier, is costlier than in-register bit testing), and it can also
suffer from a different kind of compiler optimization mishap where the compiler
optimizer "forgets" that the table contains bytes with a single bit set,
resulting in worse code in simple patterns like linear iteration.
**Expected behavior**
The bit operations provided by arrow should be the one that perform best,
which in this case are also the simplest.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]