https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846
--- Comment #26 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Peter Cordes from comment #25) > We're getting a spill/reload inside the loop with AVX512: > > .L2: > vmovdqa64 (%esp), %zmm3 > vpaddd (%eax), %zmm3, %zmm2 > addl $64, %eax > vmovdqa64 %zmm2, (%esp) > cmpl %eax, %edx > jne .L2 > > Loop finishes with the accumulator in memory *and* in ZMM2. The copy in > ZMM2 is ignored, and we get > > # narrow to 32 bytes using memory indexing instead of VEXTRACTI32X8 or > VEXTRACTI64X4 > vmovdqa 32(%esp), %ymm5 > vpaddd (%esp), %ymm5, %ymm0 > > # braindead: vextracti128 can write a new reg instead of destroying xmm0 > vmovdqa %xmm0, %xmm1 > vextracti128 $1, %ymm0, %xmm0 > vpaddd %xmm0, %xmm1, %xmm0 > > ... then a sane 128b hsum as expected, so at least that part went > right. I filed PR83850 for this (I noticed this before committing). This somehow regressed in RA or the target.