https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116109
Bug ID: 116109
Summary: Missed optimisation: unnecessary register dependency
on reduction
Product: gcc
Version: 14.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: mjr19 at cam dot ac.uk
Target Milestone: ---
With a loop such as
function mod2(x,n)
real(kind(1d0))::mod2,x(*),t
integer::i,n
t=0
!$omp simd reduction(+:t)
do i=1,n
t=t+x(i)*x(i)
end do
mod2=t
end function mod2
compiled with -O3 -mavx2 -mfma -fopenmp code similar to
.L8:
vmovupd (%rax), %ymm2
addq $32, %rax
vfmadd231pd %ymm2, %ymm2, %ymm0
cmpq %rax, %rcx
jne .L8
is produced (the above is actually gfortran-13). The corresponding C code
behaves identically.
As each fma depends on the previous fma, this executes in a single fma unit at
one result per the unit's latency. gfortran-14 splits the fma into separate mul
and add, which then executes at the rate of one instruction per add latency,
which, on older processors, beats one instruction per fma latency.
Adding -funroll-loops --param max-unroll-times=2 gives no performance
improvement, and
.L8:
vmovupd (%rax), %ymm6
vmovupd 32(%rax), %ymm7
addq $64, %rax
vfmadd231pd %ymm6, %ymm6, %ymm5
vfmadd231pd %ymm7, %ymm7, %ymm5
cmpq %rax, %r11
jne .L8
changing this to
vxorpd %xmm8, %xmm8, %xmm8
.L8:
vmovupd (%rax), %ymm6
vmovupd 32(%rax), %ymm7
addq $64, %rax
vfmadd231pd %ymm6, %ymm6, %ymm5
vfmadd231pd %ymm7, %ymm7, %ymm8
cmpq %rax, %r11
jne .L8
vaddpd %ymm8, %ymm5, %ymm5
runs twice as fast, as the two fmas in the loop are now independent of each
other. For full performance one might want six or more, as the latency of an
FMA unit is typically 3-5 cycles, and most processors have two.
I think that both OMP simd reduction(+:) and -Ofast should permit this
rearrangement, which is equivalent to simulating vector registers of double the
existing width?