Issue 64306
Summary x64 Codegen worse than MSVC for vectorized loop
Labels
Assignees
Reporter rainerzufalldererste
    Haven't had the chance to investigate too much, but this is the loop: https://godbolt.org/z/e3dMW5fbe (from here https://github.com/rainerzufalldererste/simd_dct/blob/cstiller/avx_dct/src/simd_dct.cpp#L2062)

7680x7680 file, 1024 runs (AVX2 variant)

Zen4:
| compiler | avx (std-dev) | avg (std-dev) |
| :- | - | - |
msvc    |    0.32 clk/byte (   0.31 ~ 0.32) | 13483.36 MiB/s (13345.56 ~ 13624.05) |
clang++-16 |    0.37 clk/byte (   0.36 ~    0.38) | 11602.44 MiB/s (11401.21 ~ 11810.91)
g++-11.3  |    0.36 clk/byte (   0.35 ~    0.37) | 11828.99 MiB/s (11575.90 ~ 12093.38) |


Skylake-Client:
| compiler | avx (std-dev) | avg (std-dev) |
| :- | - | - |
msvc    |    0.63 clk/byte ( 0.61 ~    0.65) | 4828.56 MiB/s ( 4700.72 ~  4963.55) |
clang++-16  | 0.64 clk/byte (   0.61 ~    0.67) |  4743.21 MiB/s ( 4550.35 ~  4953.13) |
g++-11.3   |    0.64 clk/byte (   0.61 ~    0.67) | 4737.80 MiB/s ( 4534.87 ~  4959.73) |

Very surprising to see MSVC doing well, here. I'll try to investigate further, to pin down which part of the generated ASM is particularly detrimental, but wanted to document this somewhere for now. The performance of the AVX-512 variant is significantly better on clang than both GCC & MSVC, which is why I was so surprised by this result.
_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

Reply via email to