HadrienG2 commented on code in PR #5760:
URL: https://github.com/apache/arrow-rs/pull/5760#discussion_r1599540493


##########
arrow-array/src/builder/boolean_builder.rs:
##########
@@ -166,6 +166,14 @@ impl BooleanBuilder {
         BooleanArray::from(array_data)
     }
 
+    /// Returns the current values buffer as a slice
+    ///
+    /// Boolean values are bit-packed into bytes, so to read the i-th boolean,
+    /// you must compute `values_slice[i / 8] & (1 << (i % 8)) == 0`.

Review Comment:
   Just to be clear, overall I agree that the simple fact we're having this 
conversation is a very good argument in favor of adding a note about the 
encapsulted `get_bit` to the docs, instead of suggesting a manual 
implementation. So I changed the doc in the latest commit of this branch, which 
I believe should be enough to make it mergeable. If you want, we can move the 
remainder of this discussion about `get_bit()` to a dedicated issue.
   
   But those benchmark results surprised me and I couldn't find the associated 
benchmark in the arrow repository, so given how tricky it is to correctly 
benchmark such small functions (you need to be very careful which optimization 
barriers you use and how you place them in order not to unrealistically 
pessimize the code you're benchmarking), I preferred to [write another 
benchmark myself](https://github.com/HadrienG2/bitmap-bench.git).
   
   ---
   
   Here's how the implementations compare on a random indexing scenario (asked 
to probe N bit indices which are initially resident in CPU registers, but whose 
value is hidden from the compiler's optimizer):
   
   ```text
   hidden_constant/bit_naive
                           time:   [1.3571 ns 1.3582 ns 1.3593 ns]
                           thrpt:  [2.9426 Gelem/s 2.9452 Gelem/s 2.9475 
Gelem/s]
   hidden_constant/bit_const_table
                           time:   [1.5597 ns 1.5729 ns 1.5898 ns]
                           thrpt:  [2.5161 Gelem/s 2.5431 Gelem/s 2.5646 
Gelem/s]
   hidden_constant/bit_static_table
                           time:   [1.4149 ns 1.4161 ns 1.4174 ns]
                           thrpt:  [2.8221 Gelem/s 2.8248 Gelem/s 2.8271 
Gelem/s]
   ```
   
   <details>
   <summary>Assembly analysis</summary>
   
   At the top, we find the `bit_naive` implementation, whose assembly is very 
unsurprising (just the godbolt code above, duplicated 4 times by the 4-way 
unrolling and with a bit of shuffling from LLVM's instruction scheduler).
   
   
![image](https://github.com/apache/arrow-rs/assets/1305080/a041d85c-2805-46fb-a553-5cbb573b8581)
   
   Then comes the `bit_static_table`, which as expected is faster than the 
`bit_const_table` implementation, but hampered by the extra overhead of loading 
data from memory and then doing a `test`/`setne` pair. The assembly is again 
unsurprising given what we've seen previously:
   
   
![image](https://github.com/apache/arrow-rs/assets/1305080/8268c133-cdd7-4754-b225-142821f7a6fc)
   
   And as expected, the `bit_const_table` is slowest, though it surprises me 
how much of a performance that single table store (which was sadly not 
completely moved out of the benchmark loop, but was at least a bit amortized by 
the 4-way unrolling that I added) makes wrt the `bit_static_table` version.
   
   
![image](https://github.com/apache/arrow-rs/assets/1305080/41c75dd1-c2d8-4d66-9f88-cac3fc94594b)
   
   </details>
   
   ---
   
   And here's how the implementations compare on a linear iteration scenario 
(compilers knows that we're iterating linearly over the bits of a bitmap and is 
free to optimize accordingly):
   
   ```text
   linear/bit_naive        time:   [38.233 µs 38.382 µs 38.558 µs]
                           thrpt:  [6.7988 Gelem/s 6.8298 Gelem/s 6.8565 
Gelem/s]
   linear/bit_const_table  time:   [38.010 µs 38.100 µs 38.195 µs]
                           thrpt:  [6.8634 Gelem/s 6.8803 Gelem/s 6.8967 
Gelem/s]
   linear/bit_static_table time:   [52.366 µs 52.398 µs 52.432 µs]
                           thrpt:  [4.9997 Gelem/s 5.0030 Gelem/s 5.0060 
Gelem/s]
   ```
   
   <details>
   <summary>Assembly analysis</summary>
   
   At the top, we find the `bit_naive` and `bit_const_table` implementation. 
They both generate the exact same pretty optimal assembly in this easier 
scenario, as this time the compiler's optimizer managed to optimize out the 
table store of `bit_const_table`  and just generate a linear bit probing loop:
   
   
![image](https://github.com/apache/arrow-rs/assets/1305080/1658700a-e484-4896-94fa-4afa9105e6c0)
   
   Sadly, `bit_static_table` did not get so lucky. To my surprise, it looks 
like the compiler did not manage to prove that the `BIT_MASK_STATIC` array does 
not get modified and generate the same code as for the const `BIT_MASK`, 
instead there are some surprise SSE instructions in there that I do not have 
time to look into right now:
   
   
![image](https://github.com/apache/arrow-rs/assets/1305080/a30267fb-36e5-42fa-9c19-530876979604)
   
   </details>
   
   ---
   
   All numbers are from an AMD Zen 3 CPU (AMD Ryzen 5 5600G), compiled with 
rustc v1.78 on mostly-default `cargo bench` settings (I use of the LLD linker 
instead of BFD, but it should not affect this particular benchmark).
   
   Overall, I would argue that this proves that the naive version is always 
fastest, has less performance edge cases in the face of compiler optimizer 
mishaps, and thus should become the new implementation of `get_bit`.
   
   But if you disagree with my conclusion, please feel free to re-run my 
benchmark with `cargo bench` on any machine whose performance is of interest to 
you, or point me to the benchmark that you used to conclude that the 
`bit_naive` implementation is slower than the `bit_const_table` one, so that I 
can figure out what it's doing differently and whether that points at an issue 
in my benchmark or yours.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to