mapleFU commented on PR #39403:
URL: https://github.com/apache/arrow/pull/39403#issuecomment-1873368389

   The code generate is just same as:
   
   ```
   static void TrailingBits2Bench(benchmark::State& state) {
     // Code before the loop is not measured
     for (auto _ : state) {
       for (uint64_t i = 0; i < 100'000'000; i++) {
         for (int j = 0; j < 64; j++) {
           benchmark::DoNotOptimize(TrailingBits2(i, j));
         }
       }
     }
   }
   BENCHMARK(TrailingBits2Bench);
   
   static void TrailingBits3Bench(benchmark::State& state) {
     // Code before the loop is not measured
     for (auto _ : state) {
       for (uint64_t i = 0; i < 100'000'000; i++) {
         for (int j = 0; j < 64; j++) {
           benchmark::DoNotOptimize(TrailingBits3(i, j));
         }
       }
     }
   }
   BENCHMARK(TrailingBits3Bench);
   ```
   
   With `-march=skylake`, the origin code would generate `bzhi`, and new code 
uses `shlx, not, and`. I don't know which is fater here.
   
   The bzhi is choosed might because it matches (d) : 
https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/X86/X86ISelDAGToDAG.cpp#L3710-L3716
 . But actually I don't know which is faster :-(


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to