HadrienG2 commented on code in PR #5760:
URL: https://github.com/apache/arrow-rs/pull/5760#discussion_r1599540493
##########
arrow-array/src/builder/boolean_builder.rs:
##########
@@ -166,6 +166,14 @@ impl BooleanBuilder {
BooleanArray::from(array_data)
}
+ /// Returns the current values buffer as a slice
+ ///
+ /// Boolean values are bit-packed into bytes, so to read the i-th boolean,
+ /// you must compute `values_slice[i / 8] & (1 << (i % 8)) == 0`.
Review Comment:
Just to be clear, overall I agree that the simple fact we're having this
conversation is a very good argument in favor of adding a note about the
encapsulted `get_bit` to the docs, instead of suggesting a manual
implementation. So I changed the doc in the latest commit of this branch, which
I believe should be enough to make it mergeable. If you want, we can move the
remainder of this discussion about `get_bit()` to a dedicated issue.
But those benchmark results surprised me and I couldn't find the associated
benchmark in the arrow repository, so given how tricky it is to correctly
benchmark such small functions (you need to be very careful which optimization
barriers you use and how you place them in order not to unrealistically
pessimize the code you're benchmarking), I preferred to [write another
benchmark myself](https://github.com/HadrienG2/bitmap-bench.git).
---
Here's how the implementations compare on a random indexing scenario (asked
to probe N bit indices which are initially resident in CPU registers, but whose
value is hidden from the compiler's optimizer):
```text
hidden_constant/bit_naive
time: [1.3647 ns 1.3700 ns 1.3757 ns]
thrpt: [2.9076 Gelem/s 2.9197 Gelem/s 2.9311
Gelem/s]
hidden_constant/bit_const_table
time: [1.6002 ns 1.6047 ns 1.6103 ns]
thrpt: [2.4840 Gelem/s 2.4927 Gelem/s 2.4998
Gelem/s]
hidden_constant/bit_static_table
time: [1.4540 ns 1.4614 ns 1.4703 ns]
thrpt: [2.7206 Gelem/s 2.7370 Gelem/s 2.7510
Gelem/s]
```
<details>
<summary>Assembly analysis</summary>
At the top, we find the `bit_naive` implementation, whose assembly is very
unsurprising (just the godbolt code above, duplicated 4 times by the 4-way
unrolling that I added to reduce benchmark harness overhead).

Then comes the `bit_static_table`, which as expected is faster than the
`bit_const_table` implementation, but hampered by the extra overhead of loading
data from memory and then doing a `test`/`setne` pair. The assembly is again
unsurprising given what we've seen previously:

And as expected, the `bit_const_table` is slowest, though it surprises me
how much of a performance that single table store (which was sadly not
completely moved out of the benchmark loop, but was at least a bit amortized by
the 4-way unrolling that I added) makes wrt the `bit_static_table` version.

</details>
---
And here's how the implementations compare on a linear iteration scenario
(compilers knows that we're iterating linearly over the bits of a bitmap and is
free to optimize accordingly):
```text
linear/bit_naive time: [38.233 µs 38.382 µs 38.558 µs]
thrpt: [6.7988 Gelem/s 6.8298 Gelem/s 6.8565
Gelem/s]
linear/bit_const_table time: [38.010 µs 38.100 µs 38.195 µs]
thrpt: [6.8634 Gelem/s 6.8803 Gelem/s 6.8967
Gelem/s]
linear/bit_static_table time: [52.366 µs 52.398 µs 52.432 µs]
thrpt: [4.9997 Gelem/s 5.0030 Gelem/s 5.0060
Gelem/s]
```
<details>
<summary>Assembly analysis</summary>
At the top, we find the `bit_naive` and `bit_const_table` implementation.
They both generate the exact same pretty optimal assembly in this easier
scenario, as this time the compiler's optimizer managed to optimize out the
table store of `bit_const_table` and just generate a linear bit probing loop:

Sadly, `bit_static_table` did not get so lucky. To my surprise, it looks
like the compiler did not manage to prove that the `BIT_MASK_STATIC` array does
not get modified and generate the same code as for the const `BIT_MASK`,
instead there are some surprise SSE instructions in there that I do not have
time to look into right now:

</details>
---
All numbers are from an AMD Zen 3 CPU (AMD Ryzen 5 5600G), compiled with
rustc v1.78 on mostly-default `cargo bench` settings (I use of the LLD linker
instead of BFD, but it should not affect this particular benchmark).
Overall, I would argue that this proves that the naive version is always
fastest, has less performance edge cases in the face of compiler optimizer
mishaps, and thus should become the new implementation of `get_bit`.
But if you disagree with my conclusion, please feel free to re-run my
benchmark with `cargo bench` on any machine whose performance is of interest to
you, or point me to the benchmark that you used to conclude that the
`bit_naive` implementation is slower than the `bit_const_table` one, so that I
can figure out what it's doing differently and whether that points at an issue
in my benchmark or yours.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]