https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78116

--- Comment #14 from Pat Haugen <pthaugen at gcc dot gnu.org> ---
(In reply to amker from comment #13)
> We should create another PR for additional copy instructions after my patch
> and close this one.  IMHO they are two different issues.

Yes, I agree. Yuri, can you take care of that?

Additional info, it's really just one copy introduced, but becomes 4 after
unrolling. This is the loop from the first testcase without -funroll-loops.
Looks like we could get rid of the vmovaps by making zmm2 the dest on the
vpermps (assuming I'm understanding the asm correctly).

.L26:
        vpermps (%rcx), %zmm10, %zmm1
        leal    1(%rsi), %esi
        vmovaps %zmm1, %zmm2
        vmaxps  (%r15,%rdx), %zmm3, %zmm1
        vfnmadd132ps    (%r12,%rdx), %zmm7, %zmm2
        cmpl    %esi, %r8d
        leaq    -64(%rcx), %rcx
        vmaxps  %zmm1, %zmm2, %zmm1
        vmovups %zmm1, (%rdi,%rdx)
        leaq    64(%rdx), %rdx
        ja      .L26

Reply via email to