https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037
--- Comment #11 from Richard Biener <rguenth at gcc dot gnu.org> --- So probably the big slowdown is because the vectorized loop body is so much larger. Unvectorized: .L61: vmulss __solv_cap_MOD_d1(%rip), %xmm4, %xmm0 incl %ecx vmulss (%rdx), %xmm0, %xmm0 vmovss %xmm0, (%rdx) addq %rax, %rdx cmpl %r12d, %ecx jne .L61 vectorized (see how we hoist the load), with -march=haswell: vmulss __solv_cap_MOD_d1(%rip), %xmm4, %xmm5 movq 144(%rsp), %rdi leaq (%rbx,%r13), %rdx xorl %r10d, %r10d movq %rdx, %rsi leaq 0(%r13,%rdi), %rcx movq %rcx, %rdi vbroadcastss %xmm5, %ymm5 .p2align 4,,10 .p2align 3 .L58: vmovss (%rcx,%rax,2), %xmm1 vmovss (%rsi,%rax,2), %xmm0 incl %r10d vinsertps $0x10, (%rcx,%r8), %xmm1, %xmm3 vinsertps $0x10, (%rsi,%r8), %xmm0, %xmm7 vmovss (%rcx), %xmm1 vmovss (%rsi), %xmm0 vinsertps $0x10, (%rcx,%rax), %xmm1, %xmm1 vinsertps $0x10, (%rsi,%rax), %xmm0, %xmm0 addq %r9, %rcx addq %r9, %rsi vmovlhps %xmm7, %xmm0, %xmm0 vmovlhps %xmm3, %xmm1, %xmm1 ^^ not sure why we construct in such strange way - ICC simply does a single vmovss and then 7 vinsertps vinsertf128 $0x1, %xmm1, %ymm0, %ymm0 vmulps %ymm5, %ymm0, %ymm0 vmovss %xmm0, (%rdx) vextractps $1, %xmm0, (%rdx,%rax) vextractps $2, %xmm0, (%rdx,%rax,2) vextractps $3, %xmm0, (%rdx,%r8) vextractf128 $0x1, %ymm0, %xmm0 addq %r9, %rdx vmovss %xmm0, (%rdi) vextractps $1, %xmm0, (%rdi,%rax) vextractps $2, %xmm0, (%rdi,%rax,2) vextractps $3, %xmm0, (%rdi,%r8) Similar here. But fixing this would only reduce this by a few stmts. addq %r9, %rdi cmpl %r10d, %r14d jne .L58 Anyway, this size (140 bytes, 9 cache lines) probably blows any loop stream cache limits (IIRC that was around 3 cache lines). Compared to 26 bytes for the scalar version (2 cache lines). Any such considerations would be best placed in the targets finish_cost hook where the target knows all stmts that are going to be emitted and can in theory also cost against the scalar variant (not easily available). The SSE variant is smaller so measuring the slowdown with SSE only would be interesting. Hmm, SSE variant is slower (for all of capacita) but -fno-tree-vectorize is fastest.