https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78116
--- Comment #14 from Pat Haugen <pthaugen at gcc dot gnu.org> --- (In reply to amker from comment #13) > We should create another PR for additional copy instructions after my patch > and close this one. IMHO they are two different issues. Yes, I agree. Yuri, can you take care of that? Additional info, it's really just one copy introduced, but becomes 4 after unrolling. This is the loop from the first testcase without -funroll-loops. Looks like we could get rid of the vmovaps by making zmm2 the dest on the vpermps (assuming I'm understanding the asm correctly). .L26: vpermps (%rcx), %zmm10, %zmm1 leal 1(%rsi), %esi vmovaps %zmm1, %zmm2 vmaxps (%r15,%rdx), %zmm3, %zmm1 vfnmadd132ps (%r12,%rdx), %zmm7, %zmm2 cmpl %esi, %r8d leaq -64(%rcx), %rcx vmaxps %zmm1, %zmm2, %zmm1 vmovups %zmm1, (%rdi,%rdx) leaq 64(%rdx), %rdx ja .L26