HadrienG2 commented on code in PR #5760:
URL: https://github.com/apache/arrow-rs/pull/5760#discussion_r1599540493
##########
arrow-array/src/builder/boolean_builder.rs:
##########
@@ -166,6 +166,14 @@ impl BooleanBuilder {
BooleanArray::from(array_data)
}
+ /// Returns the current values buffer as a slice
+ ///
+ /// Boolean values are bit-packed into bytes, so to read the i-th boolean,
+ /// you must compute `values_slice[i / 8] & (1 << (i % 8)) == 0`.
Review Comment:
Just to be clear, overall I agree that the simple fact we're having this
conversation is a very good argument in favor of adding a note about the
encapsulted `get_bit` to the docs, instead of suggesting a manual
implementation. So I changed the doc in the latest commit of this branch, which
I believe should be enough to make it mergeable. If you want, we can move the
remainder of this discussion about `get_bit()` to a dedicated issue.
But those benchmark results surprised me and I couldn't find the associated
benchmark in the arrow repository, so given how tricky it is to correctly
benchmark such small functions (you need to be very careful which optimization
barriers you use and how you place them in order not to unrealistically
pessimize the code you're benchmarking), I preferred to [write another
benchmark myself](https://github.com/HadrienG2/bitmap-bench.git).
---
Here's how the implementations compare on a random indexing scenario (asked
to probe N bit indices which are initially resident in CPU registers, but whose
value is hidden from the compiler's optimizer):
```text
hidden_constant/bit_naive
time: [1.3571 ns 1.3582 ns 1.3593 ns]
thrpt: [2.9426 Gelem/s 2.9452 Gelem/s 2.9475
Gelem/s]
hidden_constant/bit_const_table
time: [1.5597 ns 1.5729 ns 1.5898 ns]
thrpt: [2.5161 Gelem/s 2.5431 Gelem/s 2.5646
Gelem/s]
hidden_constant/bit_static_table
time: [1.4149 ns 1.4161 ns 1.4174 ns]
thrpt: [2.8221 Gelem/s 2.8248 Gelem/s 2.8271
Gelem/s]
```
<details>
<summary>Assembly analysis</summary>
At the top, we find the `bit_naive` implementation, whose assembly is very
unsurprising (just the godbolt code above, duplicated 4 times by the 4-way
unrolling that I added to reduce benchmark harness overhead, with a bit of
unroll and jam going on).

Then comes the `bit_static_table`, which as expected is faster than the
`bit_const_table` implementation, but hampered by the extra overhead of loading
data from memory and then doing a `test`/`setne` pair. The assembly is again
unsurprising given what we've seen previously:

And as expected, the `bit_const_table` is slowest, though it surprises me
how much of a performance that single table store (which was sadly not
completely moved out of the benchmark loop, but was at least a bit amortized by
the 4-way unrolling that I added) makes wrt the `bit_static_table` version.

</details>
---
And here's how the implementations compare on a linear iteration scenario
(compilers knows that we're iterating linearly over the bits of a bitmap and is
free to optimize accordingly):
```text
linear/bit_naive time: [38.233 µs 38.382 µs 38.558 µs]
thrpt: [6.7988 Gelem/s 6.8298 Gelem/s 6.8565
Gelem/s]
linear/bit_const_table time: [38.010 µs 38.100 µs 38.195 µs]
thrpt: [6.8634 Gelem/s 6.8803 Gelem/s 6.8967
Gelem/s]
linear/bit_static_table time: [52.366 µs 52.398 µs 52.432 µs]
thrpt: [4.9997 Gelem/s 5.0030 Gelem/s 5.0060
Gelem/s]
```
<details>
<summary>Assembly analysis</summary>
At the top, we find the `bit_naive` and `bit_const_table` implementation.
They both generate the exact same pretty optimal assembly in this easier
scenario, as this time the compiler's optimizer managed to optimize out the
table store of `bit_const_table` and just generate a linear bit probing loop:

Sadly, `bit_static_table` did not get so lucky. To my surprise, it looks
like the compiler did not manage to prove that the `BIT_MASK_STATIC` array does
not get modified and generate the same code as for the const `BIT_MASK`,
instead there are some surprise SSE instructions in there that I do not have
time to look into right now:

</details>
---
All numbers are from an AMD Zen 3 CPU (AMD Ryzen 5 5600G), compiled with
rustc v1.78 on mostly-default `cargo bench` settings (I use of the LLD linker
instead of BFD, but it should not affect this particular benchmark).
Overall, I would argue that this proves that the naive version is always
fastest, has less performance edge cases in the face of compiler optimizer
mishaps, and thus should become the new implementation of `get_bit`.
But if you disagree with my conclusion, please feel free to re-run my
benchmark with `cargo bench` on any machine whose performance is of interest to
you, or point me to the benchmark that you used to conclude that the
`bit_naive` implementation is slower than the `bit_const_table` one, so that I
can figure out what it's doing differently and whether that points at an issue
in my benchmark or yours.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]