mapleFU commented on PR #39403:
URL: https://github.com/apache/arrow/pull/39403#issuecomment-1873371041

   ```
   uint64_t TrailingBits2(uint64_t v, int num_bits) {
     if (__builtin_expect(num_bits >= 64, 0)) return v;
     return ((v >> num_bits) << num_bits) ^ v;
   }
   
   uint64_t TrailingBits3(uint64_t v, int num_bits) {
     if (__builtin_expect(num_bits >= 64, 0))
       return v;
     uint64_t mask = 0xffffffffffffffff;
     return v & ~(mask << num_bits);
   }
   ```
   
   With `-O3/-O2` and without `-march=skylake`, the generated code is like 
https://quick-bench.com/q/3IbTOnH4rShshgE7pwcX6dCbJuY . The generated code is:
   
   Current:
   
   ```
           neg     cl
           shl     rax, cl
           shr     rax, cl
   ```
   
   after:
   
   ```
           not     rdx
           mov     rax, rdx
           and     rax, rdi
   ```
   
   Would you mind test which is faster with a CPU with avx2 and bmi2 enabled? 
@Hattonuri 
   
   After I try them they generate same code with clang17.0.1 and argument 
`-std=c++20 -O3 -march=skylake`. The differences is listed here: 
https://github.com/apache/arrow/pull/39403#pullrequestreview-1799813236


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to