https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123198
Bug ID: 123198
Summary: Unnecessary vperms when vectorising with stride of -1
Product: gcc
Version: 15.1.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: fortran
Assignee: unassigned at gcc dot gnu.org
Reporter: mjr19 at cam dot ac.uk
Target Milestone: ---
subroutine foo(a,b,n)
real(kind(1d0))::a(*),b(*)
integer::i,n
do i=n,1,-1
a(i)=a(i)+b(i)
end do
end subroutine foo
compiles with gfortran-15 -O3 -march=core-avx2 to a main loop of
.L5:
vpermpd $27, (%r8,%rax), %ymm0
vpermpd $27, (%r9,%rax), %ymm1
vaddpd %ymm1, %ymm0, %ymm0
vpermpd $27, %ymm0, %ymm0
vmovupd %ymm0, (%r8,%rax)
subq $32, %rax
cmpq %rax, %rsi
jne .L5
The vpermpd instructions, which reverse the order of the elements in
the vector register before the addition, then unreverse them
afterwards, seem unnecessary. Should not the first two be plain vmovupd,
and the third be eliminated?
The same issue affects
do concurrent (i=1:n)
vs
do concurrent (i=n:1:-1)
for the same loop body, in which it is clearer that order does not matter.