https://bugs.llvm.org/show_bug.cgi?id=34843
Bug ID: 34843
Summary: Suboptimal code generation for __builtin_ctz(ll)
Product: clang
Version: 5.0
Hardware: PC
OS: Linux
Status: NEW
Severity: enhancement
Priority: P
Component: LLVM Codegen
Assignee: [email protected]
Reporter: [email protected]
CC: [email protected]
Right now, when no specific arch target is set, the builtin
__builtin_ctz (and long, long long variants)
will generate a bsf instruction.
This is suboptimal for AMD machines, which can do a TZCNT much faster than they
can do a BSF. Due to the way TZCNT is encoded, it is equal to a REP BSF, so it
is in fact "backwards compatible" as long as the different behavior for a 0 is
fine. And it is, because __builtin_ctz has undefined behavior for 0 (which is
why it can use BSF in the first place).
On Intel hardware, either way is equally fast, so for a generic target it makes
sense to deal with the AMD case and encode the intrinsic as REP BSF/TZNCT.
At least GCC 4.8 and later are able to do this optimization and generate a REP
BSF for their generic target. Clang fails to do so. (It does generate TZCNT
with -march=znver1)
Example snippet:
https://godbolt.org/g/eXU6xf
Of note in this snippet is also that newer GCC adds a XOR ESI, ESI before the
REP BSF. So there may be a false dependency issue in some CPUs.
--
You are receiving this mail because:
You are on the CC list for the bug._______________________________________________
llvm-bugs mailing list
[email protected]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs