jhorstmann opened a new issue #397:
URL: https://github.com/apache/arrow-rs/issues/397


   In one of our benchmarks the `concat` kernel was identified as a big 
performance bottleneck while sorting, specifically the closures inside 
`build_extend_null_bits`. The logic in there currently sets individual bits and 
also contains a branch for every bit
   
   ```
   if bit_util::get_bit(...) {
       bit_util::set_bit(...);
   }
   ```
   
   I think it should be possible to rewrite this to set multiple bits at the 
same time and remove most of the branch overhead. The general idea would look 
like this:
   
   - append individual bits until the destination buffer starts at a byte offset
   - start a BitChunk iterator on the source buffer and then append u8 or u64 
at a time
   - append the remainder u8 at a time
   
   Similar logic would apply to setting all bits to valid, appending chunks of 
u8::MAX or u64::MAX at a time.
   
   The `get_bit` / `set_bit` functions themselves could probably also be speed 
up a little, I think on modern processors calculating the bit masks instead of 
using a lookup table should be faster. But after the above changes, those 
functions would no longer be used in the hot path.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to