HadrienG2 commented on code in PR #5760:
URL: https://github.com/apache/arrow-rs/pull/5760#discussion_r1599381726


##########
arrow-array/src/builder/boolean_builder.rs:
##########
@@ -166,6 +166,14 @@ impl BooleanBuilder {
         BooleanArray::from(array_data)
     }
 
+    /// Returns the current values buffer as a slice
+    ///
+    /// Boolean values are bit-packed into bytes, so to read the i-th boolean,
+    /// you must compute `values_slice[i / 8] & (1 << (i % 8)) == 0`.

Review Comment:
   Will do so, but out of curiosity, did you benchmark that the current 
implementation is actually faster? A [quick ASM 
comparison](https://godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(filename:'1',fontScale:14,fontUsePx:'0',j:1,lang:rust,selection:(endColumn:44,endLineNumber:15,positionColumn:1,positionLineNumber:14,selectionStartColumn:44,selectionStartLineNumber:15,startColumn:1,startLineNumber:14),source:'pub+fn+bit_naive(bitmap:+%26%5Bu8%5D,+idx:+usize)+-%3E+bool+%7B%0A++++bitmap%5Bidx+/+8%5D+%26+(1+%3C%3C+(idx+%25+8))+!!%3D+0%0A%7D%0A%0Apub+fn+bit_const_table(bitmap:+%26%5Bu8%5D,+idx:+usize)+-%3E+bool+%7B%0A++++bitmap%5Bidx+%3E%3E+3%5D+%26+BIT_MASK%5Bidx+%26+7%5D+!!%3D+0%0A%7D%0A%0Aconst+BIT_MASK:+%5Bu8%3B+8%5D+%3D+%5B1,+2,+4,+8,+16,+32,+64,+128%5D%3B%0A%0Apub+fn+bit_static_table(bitmap:+%26%5Bu8%5D,+idx:+usize)+-%3E+bool+%7B%0A++++bitmap%5Bidx+%3E%3E+3%5D+%26+BIT_MASK_STATIC%5Bidx+%26+7%5D+!!%3D+0%0A%7D%0A%0Astatic+BIT_MASK_STATIC:+%5Bu8%3B+8%5D+%3D+BIT_MASK%3B'),l:'5',n:'1',o:
 
'Rust+source+%231',t:'0')),k:50,l:'4',n:'0',o:'',s:0,t:'0'),(g:!((h:compiler,i:(compiler:r1780,filters:(b:'0',binary:'1',binaryObject:'1',commentOnly:'0',debugCalls:'1',demangle:'0',directives:'0',execute:'1',intel:'0',libraryCode:'0',trim:'1',verboseDemangling:'0'),flagsViewOpen:'1',fontScale:14,fontUsePx:'0',j:1,lang:rust,libs:!(),options:'-O',overrides:!(),selection:(endColumn:25,endLineNumber:22,positionColumn:25,positionLineNumber:22,selectionStartColumn:25,selectionStartLineNumber:22,startColumn:25,startLineNumber:22),source:1),l:'5',n:'0',o:'+rustc+1.78.0+(Editor+%231)',t:'0')),k:50,l:'4',n:'0',o:'',s:0,t:'0')),l:'2',n:'0',o:'',t:'0')),version:4)
 makes me a little surprised that it is the case.
   
   - Indexing a bitmap is a common operation that optimizing compilers know how 
deal with pretty well, trivially turning `x / 8` into `x >> 3`, `x % 8` into `x 
& 7` and `int & (1 << N)` where N can be proven at compile time to be in bounds 
into a bit test. So effectively we're comparing a `bt/setb` pair into to a 
table lookup + `test/setne` pair here, and from <https://uops.info/table.html>, 
it does not look like `bt/setb` is particularly badly implemented on current 
CPUs.
   - Furthermore, since the table is specified to be a const, and thus does not 
have a dedicated memory location in global program data, extra instructions had 
to be generated to push the table to the stack on every program call, adding a 
memory store (which worries me because these have an unfortunate tendency to 
create superscalar memory hazards that worsen the latency of loads surrounding 
them) on the hot code path.
       * The store can be avoided (or more accurately turned into a load, which 
will reliably hit L1 if the function is called in the kind of tight loops where 
its performance matters) by turning a const into a static, as shown in the 
linked godbolt snippet.
   
   Against, I'm just static-analyzing the assembly by eye here, did not take 
the time to set up a proper microbenchmark. Just surprised that it makes a 
positive difference.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to