https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123198

            Bug ID: 123198
           Summary: Unnecessary vperms when vectorising with stride of -1
           Product: gcc
           Version: 15.1.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: fortran
          Assignee: unassigned at gcc dot gnu.org
          Reporter: mjr19 at cam dot ac.uk
  Target Milestone: ---

subroutine foo(a,b,n)
  real(kind(1d0))::a(*),b(*)
  integer::i,n

  do i=n,1,-1
     a(i)=a(i)+b(i)
  end do
end subroutine foo

compiles with gfortran-15 -O3 -march=core-avx2 to a main loop of

.L5:
        vpermpd $27, (%r8,%rax), %ymm0
        vpermpd $27, (%r9,%rax), %ymm1
        vaddpd  %ymm1, %ymm0, %ymm0
        vpermpd $27, %ymm0, %ymm0
        vmovupd %ymm0, (%r8,%rax)
        subq    $32, %rax
        cmpq    %rax, %rsi
        jne     .L5

The vpermpd instructions, which reverse the order of the elements in
the vector register before the addition, then unreverse them
afterwards, seem unnecessary. Should not the first two be plain vmovupd,
and the third be eliminated?

The same issue affects

  do concurrent (i=1:n)

vs

  do concurrent (i=n:1:-1)

for the same loop body, in which it is clearer that order does not matter.

Reply via email to