HadrienG2 commented on code in PR #5760:
URL: https://github.com/apache/arrow-rs/pull/5760#discussion_r1599540493


##########
arrow-array/src/builder/boolean_builder.rs:
##########
@@ -166,6 +166,14 @@ impl BooleanBuilder {
         BooleanArray::from(array_data)
     }
 
+    /// Returns the current values buffer as a slice
+    ///
+    /// Boolean values are bit-packed into bytes, so to read the i-th boolean,
+    /// you must compute `values_slice[i / 8] & (1 << (i % 8)) == 0`.

Review Comment:
   Just to be clear, overall I agree that the simple fact we're having this 
conversation is a very good argument in favor of adding a note about the 
encapsulted `get_bit` to the docs, instead of suggesting a manual 
implementation. So I changed the doc in the latest commit of this branch, which 
I believe should be enough to make it mergeable. If you want, we can move the 
remainder of this discussion about `get_bit()` to a dedicated issue.
   
   But those benchmark results surprised me and I couldn't find the associated 
benchmark in the arrow repository, so given how tricky it is to correctly 
benchmark such small functions (you need to be very careful which optimization 
barriers you use and how you place them in order not to unrealistically 
pessimize the code you're benchmarking), I preferred to [write another 
benchmark myself](https://github.com/HadrienG2/bitmap-bench.git).
   
   ---
   
   Here's how the implementations compare on a random indexing scenario (asked 
to probe N bit indices which are initially resident in CPU registers, but whose 
value is hidden from the compiler's optimizer):
   
   ```text
   hidden_constant/bit_naive
                           time:   [1.3647 ns 1.3700 ns 1.3757 ns]
                           thrpt:  [2.9076 Gelem/s 2.9197 Gelem/s 2.9311 
Gelem/s]
   hidden_constant/bit_const_table
                           time:   [1.6002 ns 1.6047 ns 1.6103 ns]
                           thrpt:  [2.4840 Gelem/s 2.4927 Gelem/s 2.4998 
Gelem/s]
   hidden_constant/bit_static_table
                           time:   [1.4540 ns 1.4614 ns 1.4703 ns]
                           thrpt:  [2.7206 Gelem/s 2.7370 Gelem/s 2.7510 
Gelem/s]
   ```
   
   <details>
   <summary>Assembly analysis</summary>
   At the top, we find the `bit_naive` implementation, whose assembly is very 
unsurprising (just the godbolt code above, duplicated 4 times by the 4-way 
unrolling that I added to reduce benchmark harness overhead).
   
   
![image](https://github.com/apache/arrow-rs/assets/1305080/4f07cb47-b469-405c-ae8a-b266d488c9e8)
   
   Then comes the `bit_static_table`, which as expected is faster than the 
`bit_const_table` implementation, but hampered by the extra overhead of loading 
data from memory and then doing a `test`/`setne` pair. The assembly is again 
unsurprising given what we've seen previously:
   
   
![image](https://github.com/apache/arrow-rs/assets/1305080/58c6a5a3-0902-4f55-979f-9e0ce54444e4)
   
   And as expected, the `bit_const_table` is slowest, though it surprises me 
how much of a performance that single table store (which was sadly not 
completely moved out of the benchmark loop, but was at least a bit amortized by 
the 4-way unrolling that I added) makes wrt the `bit_static_table` version.
   
   
![image](https://github.com/apache/arrow-rs/assets/1305080/3244d82f-da55-4492-beab-909f56ccee28)
   </details>
   
   ---
   
   And here's how the implementations compare on a linear iteration scenario 
(compilers knows that we're iterating linearly over the bits of a bitmap and is 
free to optimize accordingly):
   
   ```text
   linear/bit_naive        time:   [38.233 µs 38.382 µs 38.558 µs]
                           thrpt:  [6.7988 Gelem/s 6.8298 Gelem/s 6.8565 
Gelem/s]
   linear/bit_const_table  time:   [38.010 µs 38.100 µs 38.195 µs]
                           thrpt:  [6.8634 Gelem/s 6.8803 Gelem/s 6.8967 
Gelem/s]
   linear/bit_static_table time:   [52.366 µs 52.398 µs 52.432 µs]
                           thrpt:  [4.9997 Gelem/s 5.0030 Gelem/s 5.0060 
Gelem/s]
   ```
   
   <details>
   <summary>Assembly analysis</summary>
   At the top, we find the `bit_naive` and `bit_const_table` implementation. 
They both generate the exact same pretty optimal assembly in this easier 
scenario, as this time the compiler's optimizer managed to optimize out the 
table store of `bit_const_table`  and just generate a linear bit probing loop:
   
   
![image](https://github.com/apache/arrow-rs/assets/1305080/1658700a-e484-4896-94fa-4afa9105e6c0)
   
   Sadly, `bit_static_table` did not get so lucky. To my surprise, the compiler 
did not manage to prove that the `BIT_MASK_STATIC` array does not get modified, 
and hence had to load and unpack it every time we switch to the next byte:
   
   
![image](https://github.com/apache/arrow-rs/assets/1305080/a30267fb-36e5-42fa-9c19-530876979604)
   </details>
   
   ---
   
   All numbers are from an AMD Zen 3 CPU (AMD Ryzen 5 5600G), compiled with 
rustc v1.78 on mostly-default `cargo bench` settings (I use of the LLD linker 
instead of BFD, but it should not affect this particular benchmark).
   
   Overall, I would argue that this proves that the naive version is always 
fastest, has less performance edge cases in the face of compiler optimizer 
mishaps, and thus should become the new implementation of `get_bit`.
   
   But if you disagree with my conclusion, please feel free to re-run my 
benchmark on any machine whose performance is of interest to you, or point me 
to the benchmark that you used to conclude that the `bit_naive` implementation 
is slower than the `bit_const_table` one, so that I can figure out what it's 
doing differently and whether that points at an issue in my benchmark or yours.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to