jhorstmann opened a new issue #397:
URL: https://github.com/apache/arrow-rs/issues/397
In one of our benchmarks the `concat` kernel was identified as a big
performance bottleneck while sorting, specifically the closures inside
`build_extend_null_bits`. The logic in there currently sets individual bits and
also contains a branch for every bit
```
if bit_util::get_bit(...) {
bit_util::set_bit(...);
}
```
I think it should be possible to rewrite this to set multiple bits at the
same time and remove most of the branch overhead. The general idea would look
like this:
- append individual bits until the destination buffer starts at a byte offset
- start a BitChunk iterator on the source buffer and then append u8 or u64
at a time
- append the remainder u8 at a time
Similar logic would apply to setting all bits to valid, appending chunks of
u8::MAX or u64::MAX at a time.
The `get_bit` / `set_bit` functions themselves could probably also be speed
up a little, I think on modern processors calculating the bit masks instead of
using a lookup table should be faster. But after the above changes, those
functions would no longer be used in the hot path.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]