https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84986
Bug ID: 84986 Summary: Performance regression: loop no longer vectorized (x86-64) Product: gcc Version: 8.0.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: gergo.barany at inria dot fr Target Milestone: --- Created attachment 43713 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43713&action=edit input function showing performance regression For context: I throw randomly generated code at compilers and look at differences in how they optimize; see https://github.com/gergo-/missed-optimizations for details if interested. The test case below is entirely artificial, I do *not* have any real-world application that depends on this. The attached test.c file contains a function with a simple loop: int N; long fn1(void) { short i; long a; i = a = 0; while (i < N) a -= i++; return a; } Until recently, this loop used to be vectorized on x86-64, with the core loop (if I understand the code correctly) looking something like this, as generated by GCC trunk from 20180206 (with -O3): 40: 66 0f 6f ce movdqa %xmm6,%xmm1 44: 66 0f 6f e3 movdqa %xmm3,%xmm4 48: 66 0f 6f d3 movdqa %xmm3,%xmm2 4c: 83 c0 01 add $0x1,%eax 4f: 66 0f 65 cb pcmpgtw %xmm3,%xmm1 53: 66 0f fd df paddw %xmm7,%xmm3 57: 66 0f 69 e1 punpckhwd %xmm1,%xmm4 5b: 66 0f 61 d1 punpcklwd %xmm1,%xmm2 5f: 66 0f 6f cc movdqa %xmm4,%xmm1 63: 66 0f 6f e5 movdqa %xmm5,%xmm4 67: 66 44 0f 6f c2 movdqa %xmm2,%xmm8 6c: 66 0f 66 e2 pcmpgtd %xmm2,%xmm4 70: 66 44 0f 62 c4 punpckldq %xmm4,%xmm8 75: 66 0f 6a d4 punpckhdq %xmm4,%xmm2 79: 66 0f 6f e1 movdqa %xmm1,%xmm4 7d: 66 41 0f fb c0 psubq %xmm8,%xmm0 82: 66 0f fb c2 psubq %xmm2,%xmm0 86: 66 0f 6f d5 movdqa %xmm5,%xmm2 8a: 66 0f 66 d1 pcmpgtd %xmm1,%xmm2 8e: 66 0f 62 e2 punpckldq %xmm2,%xmm4 92: 66 0f 6a ca punpckhdq %xmm2,%xmm1 96: 66 0f fb c4 psubq %xmm4,%xmm0 9a: 66 0f fb c1 psubq %xmm1,%xmm0 9e: 39 c1 cmp %eax,%ecx a0: 77 9e ja 40 <fn1+0x40> (I'm sorry this comes from objdump, I didn't keep that GCC version around to generate a nicer assembly listing.) With a version from 20180319 (r258665), this is no longer the case: .L3: movswq %dx, %rcx addl $1, %edx subq %rcx, %rax movswl %dx, %ecx cmpl %esi, %ecx jl .L3 Linking the two versions against a driver program, which simply calls this function many times after setting N to SHRT_MAX, shows a slowdown of about 1.8x: $ time ./test.20180206 ; time ./test.20180319 32767 elements in 0.000009 sec on average, result = -536821761000000 real 0m8.875s user 0m8.844s sys 0m0.028s 32767 elements in 0.000016 sec on average, result = -536821761000000 real 0m15.691s user 0m15.688s sys 0m0.000s Target: x86_64-pc-linux-gnu Configured with: ../../src/gcc/configure --prefix=/home/gergo/optcheck/compilers/install --enable-languages=c --with-newlib --without-headers --disable-bootstrap --disable-nls --disable-shared --disable-multilib --disable-decimal-float --disable-threads --disable-libatomic --disable-libgomp --disable-libmpx --disable-libquadmath --disable-libssp --disable-libvtv --disable-libstdcxx --program-prefix=optcheck-x86- --target=x86_64-pc-linux-gnu Thread model: single This is under Linux on a machine whose CPU identifies itself as Intel(R) Core(TM) i7-4712HQ CPU @ 2.30GHz. For whatever it's worth, Clang goes the opposite way, vectorizes very aggressively, and ends up slower: $ time ./test.clang 32767 elements in 0.000019 sec on average, result = -536821761000000 real 0m18.930s user 0m18.928s sys 0m0.000s With the previous version, GCC was about 2.1x faster than Clang, this seems to have regressed to "only" 1.2x faster.