mapleFU commented on PR #39403:
URL: https://github.com/apache/arrow/pull/39403#issuecomment-1873371041
```
uint64_t TrailingBits2(uint64_t v, int num_bits) {
if (__builtin_expect(num_bits >= 64, 0)) return v;
return ((v >> num_bits) << num_bits) ^ v;
}
uint64_t TrailingBits3(uint64_t v, int num_bits) {
if (__builtin_expect(num_bits >= 64, 0))
return v;
uint64_t mask = 0xffffffffffffffff;
return v & ~(mask << num_bits);
}
```
With `-O3/-O2` and without `-march=skylake`, the generated code is like
https://quick-bench.com/q/3IbTOnH4rShshgE7pwcX6dCbJuY . The generated code is:
Current:
```
neg cl
shl rax, cl
shr rax, cl
```
after:
```
not rdx
mov rax, rdx
and rax, rdi
```
Would you mind test which is faster with a CPU with avx2 and bmi2 enabled?
@Hattonuri
After I try them they generate same code with clang17.0.1 and argument
`-std=c++20 -O3 -march=skylake`. The differences is listed here:
https://github.com/apache/arrow/pull/39403#pullrequestreview-1799813236
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]