[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

rguenth at gcc dot gnu.org Mon, 15 Jan 2018 02:20:13 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846


--- Comment #26 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Peter Cordes from comment #25)
> We're getting a spill/reload inside the loop with AVX512:
> 
>     .L2:
>       vmovdqa64       (%esp), %zmm3
>       vpaddd  (%eax), %zmm3, %zmm2
>       addl    $64, %eax
>       vmovdqa64       %zmm2, (%esp)
>       cmpl    %eax, %edx
>       jne     .L2
> 
> Loop finishes with the accumulator in memory *and* in ZMM2.  The copy in
> ZMM2 is ignored, and we get
> 
>     # narrow to 32 bytes using memory indexing instead of VEXTRACTI32X8 or
> VEXTRACTI64X4
>       vmovdqa 32(%esp), %ymm5
>       vpaddd  (%esp), %ymm5, %ymm0
> 
>     # braindead: vextracti128 can write a new reg instead of destroying xmm0
>       vmovdqa %xmm0, %xmm1
>       vextracti128    $1, %ymm0, %xmm0
>       vpaddd  %xmm0, %xmm1, %xmm0
> 
>         ... then a sane 128b hsum as expected, so at least that part went
> right.

I filed PR83850 for this (I noticed this before committing).  This somehow
regressed in RA or the target.

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

Reply via email to