Issue |
159670
|
Summary |
[x86-64] Miscompilation of certain SIMD intrinsics
|
Labels |
new issue
|
Assignees |
|
Reporter |
dcommander
|
I am in the process of converting the x86-64 SIMD modules in libjpeg-turbo to compiler intrinsics (libjpeg-turbo/libjpeg-turbo#732.) However, I am encountering a problem whereby Clang/LLVM tries to outsmart me when it translates certain intrinsics into assembly code, and the resulting assembly code is often not smart at all. Here is a good example:
https://godbolt.org/z/7YGrvGbnc
Clang translates the two intrinsics into seven assembly instructions, whereas GCC correctly translates them into the two assembly instructions that correspond to the intrinsics. The result is that, when compiled with GCC, the intrinsics version of our color conversion algorithm performs as well as the NASM version, but when compiled with Clang, the intrinsics version regresses by 20-30%.
With AVX2, Clang translates the equivalent two intrinsics into two assembly instructions, but they are slower instructions than the instructions that correspond to the intrinsics:
https://godbolt.org/z/nzx7f16e7
If someone goes to the trouble of writing intrinsics that have a [documented 1:1 correspondence with assembly instructions](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html), it's because they are trying to talk to the hardware more directly. The compiler really shouldn't second guess them in that case.
Is there a way to disable this behavior?
I have tried all of the `-O` options, to no avail. I did observe that changing `-msse2` to `-mssse3` (or targeting any later SIMD instruction set, such as AVX2) causes Clang to compile `_mm_slli_si128()` and `_mm_unpackhi_epi8()` into `vpshufd` and `vpunpcklbw` rather than `pslldq` and `punpckhbw`. That behavior is inscrutable, though, since `vpshufd` and `vpunpcklbw` are both SSE2 instructions.
_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs