[I] Attempted optimizations in arrow_buffer::util::bit_util do more harm than good [arrow-rs]

via GitHub Wed, 15 May 2024 23:55:48 -0700


HadrienG2 opened a new issue, #5771:
URL: https://github.com/apache/arrow-rs/issues/5771


   **Describe the bug**
   The bitmap access and manipulation functions in 
`arrow_buffer::util::bit_util` do not follow the most obvious implementation 
strategy...
   
   ```rust
   pub fn get_bit(data: &[u8], i: usize) -> bool {
       data[i / 8] & (1 << (i % 8)) != 0
   }
   ```
   
   Instead, they use table lookup techniques like...
   
   ```rust
   const BIT_MASK: [u8; 8] = [1, 2, 4, 8, 16, 32, 64, 128];
   pub fn get_bit(data: &[u8], i: usize) -> bool {
       (data[i >> 3] & BIT_MASK[i & 7]) != 0
   }
   ```
   
   According to [the discussion within PR 
5760](https://github.com/apache/arrow-rs/pull/5760/files/f14bfbe3eca72bcf1f92d8d2e5ba6623c0060092#diff-8435cfc9abe9dd33cabed382049204a64acc49ffd2b7ad84dfe88339e864151a),
 this was demonstrated to be beneficial to performance during the development 
of older versions of arrow.
   
   However, in my testing with rustc 1.78 and a Zen 3 CPU (AMD Ryzen 5 5600G), 
I have shown that modern rustc is in fact perfectly able of generating optimal 
code from the simple version, and that the table lookup version can easily 
degrade into worse performance if the compiler optimization process does not 
perfectly succeed at eliminating its overhead.
   
   Therefore, I think arrow should switch back to the obvious code, and will 
submit a PR accordingly.
   
   **To Reproduce**
   The repository https://github.com/HadrienG2/bitmap-bench contains 
microbenchmarks of three implementations of bit readout, setting and clearing:
   
   - The simplest implementation, called `naive`.
   - The current arrow implementation, called `const_table`.
   - A variation of the current arrow implementation where the table is stored 
inside of a static variable instead of a const, called `static_table` (the 
rationale for this variant will be explained below).
   
   These implementations are exercised by the benchmark on a bitmap which fits 
into an L1 cache (which is when the performance of bit access functions has the 
largest impact). Three access patterns are evaluated:
   
   - The `hidden_constant` pattern repeatedly accesses the bitmap at widely 
spaced indices which are hidden from the compiler's optimizer. This models a 
scenario where the compiler optimizer does not manage to figure out the user's 
data access pattern and must generate maximally pessimistic random bitmap 
access code.
   - The `linear_all` pattern accesses all bits from the bitmap from start to 
finish, and exposes this fact to the compiler's optimizer. The optimizer should 
thus be able to leverage this regular pattern for optimization, for example 
accessing all bits of a bitmap should compile into a simple shift-and-test 
loop, setting all the bits of a bitmap in order should compile into writing 
`0xff` to each byte of the bitmap, and clearing all the bits of the bitmap in 
order should compile into writing `0x00` to each byte of the bitmap.
   - The `linear_strided` pattern is similar to `linear_all`, but accesses 
every other bit instead of accessing every bit. This is a little bit harder on 
the optimizer, but is still expected to compile into byte-wise operations 
during writes (byte |= 0b01010101 for setting, byte &= 0b10101010 for clearing).
   
   You can run the benchmark workloads with `cargo bench`, and on my machine 
(rustc 1.78, AMD Ryzen 5 5600G CPU), this yields the following results:
   
   - The simple implementation is almost always the fastest (within margin of 
measurement error), except in one scenario (clearing all bits) where it was 
measured a bit slower. This scenario was further investigated, and it turns out 
that the compiler generated assembly which is rigorously identical to the 
measured-fastest implementation in that scenario, so my understanding is that 
the speed difference comes from a benchmark artifact like a code alignment 
issue that is not relevant to real-world usage.
   - The current table lookup implementation can be as fast as the simple 
implementation when the compiler manages to optimize out the table and compile 
it into in-register bit-testing code. But because the table is a const, which 
does not have a fixed memory location in the program address space, whenever 
this optimization process does not go perfectly smoothly, the compiler must 
emit an instruction which writes the table to the program stack, which is quite 
costly, followed by table lookups which are also costlier than in-register bit 
testing.
   - I tried to evaluate the impact of table writes by writing an alternate 
version of the table lookup implementation where the table is a static variable 
instead of a const. As expected, this implementation never suffers from table 
write overheads, however it still suffers from table lookup overhead (which, as 
mentioned earlier, is costlier than in-register bit testing), and it can also 
suffer from a different kind of compiler optimization mishap where the compiler 
optimizer "forgets" that the table contains bytes with a single bit set, 
resulting in worse code in simple patterns like linear iteration.
   
   **Expected behavior**
   The bit operations provided by arrow should be the one that perform best, 
which in this case are also the simplest.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Attempted optimizations in arrow_buffer::util::bit_util do more harm than good [arrow-rs]

Reply via email to