https://bugs.llvm.org/show_bug.cgi?id=34843

            Bug ID: 34843
           Summary: Suboptimal code generation for __builtin_ctz(ll)
           Product: clang
           Version: 5.0
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: enhancement
          Priority: P
         Component: LLVM Codegen
          Assignee: [email protected]
          Reporter: [email protected]
                CC: [email protected]

Right now, when no specific arch target is set, the builtin

__builtin_ctz (and long, long long variants)

will generate a bsf instruction.

This is suboptimal for AMD machines, which can do a TZCNT much faster than they
can do a BSF. Due to the way TZCNT is encoded, it is equal to a REP BSF, so it
is in fact "backwards compatible" as long as the different behavior for a 0 is
fine. And it is, because __builtin_ctz has undefined behavior for 0 (which is
why it can use BSF in the first place). 

On Intel hardware, either way is equally fast, so for a generic target it makes
sense to deal with the AMD case and encode the intrinsic as REP BSF/TZNCT.

At least GCC 4.8 and later are able to do this optimization and generate a REP
BSF for their generic target. Clang fails to do so. (It does generate TZCNT
with -march=znver1)

Example snippet:
https://godbolt.org/g/eXU6xf

Of note in this snippet is also that newer GCC adds a XOR ESI, ESI before the
REP BSF. So there may be a false dependency issue in some CPUs.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
_______________________________________________
llvm-bugs mailing list
[email protected]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

Reply via email to