https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037

--- Comment #11 from Richard Biener <rguenth at gcc dot gnu.org> ---
So probably the big slowdown is because the vectorized loop body is so much
larger.  Unvectorized:

.L61:
        vmulss  __solv_cap_MOD_d1(%rip), %xmm4, %xmm0
        incl    %ecx
        vmulss  (%rdx), %xmm0, %xmm0
        vmovss  %xmm0, (%rdx)
        addq    %rax, %rdx
        cmpl    %r12d, %ecx
        jne     .L61

vectorized (see how we hoist the load), with -march=haswell:

        vmulss  __solv_cap_MOD_d1(%rip), %xmm4, %xmm5
        movq    144(%rsp), %rdi
        leaq    (%rbx,%r13), %rdx
        xorl    %r10d, %r10d
        movq    %rdx, %rsi
        leaq    0(%r13,%rdi), %rcx
        movq    %rcx, %rdi
        vbroadcastss    %xmm5, %ymm5
        .p2align 4,,10
        .p2align 3
.L58:
        vmovss  (%rcx,%rax,2), %xmm1
        vmovss  (%rsi,%rax,2), %xmm0
        incl    %r10d
        vinsertps       $0x10, (%rcx,%r8), %xmm1, %xmm3
        vinsertps       $0x10, (%rsi,%r8), %xmm0, %xmm7
        vmovss  (%rcx), %xmm1
        vmovss  (%rsi), %xmm0
        vinsertps       $0x10, (%rcx,%rax), %xmm1, %xmm1
        vinsertps       $0x10, (%rsi,%rax), %xmm0, %xmm0
        addq    %r9, %rcx
        addq    %r9, %rsi
        vmovlhps        %xmm7, %xmm0, %xmm0
        vmovlhps        %xmm3, %xmm1, %xmm1

^^ not sure why we construct in such strange way - ICC simply
does a single vmovss and then 7 vinsertps

        vinsertf128     $0x1, %xmm1, %ymm0, %ymm0
        vmulps  %ymm5, %ymm0, %ymm0
        vmovss  %xmm0, (%rdx)
        vextractps      $1, %xmm0, (%rdx,%rax)
        vextractps      $2, %xmm0, (%rdx,%rax,2)
        vextractps      $3, %xmm0, (%rdx,%r8)
        vextractf128    $0x1, %ymm0, %xmm0
        addq    %r9, %rdx
        vmovss  %xmm0, (%rdi)
        vextractps      $1, %xmm0, (%rdi,%rax)
        vextractps      $2, %xmm0, (%rdi,%rax,2)
        vextractps      $3, %xmm0, (%rdi,%r8)

Similar here.  But fixing this would only reduce this by a few stmts.

        addq    %r9, %rdi
        cmpl    %r10d, %r14d
        jne     .L58

Anyway, this size (140 bytes, 9 cache lines) probably blows any loop
stream cache limits (IIRC that was around 3 cache lines).  Compared
to 26 bytes for the scalar version (2 cache lines).

Any such considerations would be best placed in the targets finish_cost
hook where the target knows all stmts that are going to be emitted and
can in theory also cost against the scalar variant (not easily available).

The SSE variant is smaller so measuring the slowdown with SSE only would
be interesting.  Hmm, SSE variant is slower (for all of capacita) but
-fno-tree-vectorize is fastest.

Reply via email to