https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88492
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Keywords| |missed-optimization CC| |rguenth at gcc dot gnu.org --- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> --- IIRC we have a duplicate for this. The issue is the SLP vectorizer doesn't handle reductions (not implemented) and thus the vector results need to be decomposed for the scalar reduction tail. On x86 we get with -mavx2 vmovdqu (%rdi), %xmm0 vpshufb .LC0(%rip), %xmm0, %xmm0 vpmovzxbw %xmm0, %xmm1 vpsrldq $8, %xmm0, %xmm0 vpmovzxwd %xmm1, %xmm2 vpsrldq $8, %xmm1, %xmm1 vpmovzxbw %xmm0, %xmm0 vpmovzxwd %xmm1, %xmm1 vmovaps %xmm2, -72(%rsp) movl -68(%rsp), %eax vmovaps %xmm1, -56(%rsp) vpmovzxwd %xmm0, %xmm1 vpsrldq $8, %xmm0, %xmm0 addl -52(%rsp), %eax vpmovzxwd %xmm0, %xmm0 vmovaps %xmm1, -40(%rsp) movl -56(%rsp), %edx addl -36(%rsp), %eax vmovaps %xmm0, -24(%rsp) addl -72(%rsp), %edx addl -20(%rsp), %eax addl -40(%rsp), %edx addl -24(%rsp), %edx addl %edx, %eax movl -48(%rsp), %edx addl -64(%rsp), %edx addl -32(%rsp), %edx addl -16(%rsp), %edx addl %edx, %eax movl -44(%rsp), %edx addl -60(%rsp), %edx addl -28(%rsp), %edx addl -12(%rsp), %edx addl %edx, %eax ret the main issue of course that we fail to elide the stack temporary. Re-running FRE after loop opts might help here but of course SLP vectorization handling the reduction would be best (though the tail loop is structured badly, not matching up with the head one). Whether vectorizing this specific testcases head loop is profitable or not is questionable on its own of course (but you can easily make it so and still get similar ugly code in the tail).