https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116109

            Bug ID: 116109
           Summary: Missed optimisation: unnecessary register dependency
                    on reduction
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: mjr19 at cam dot ac.uk
  Target Milestone: ---

With a loop such as

function mod2(x,n)
  real(kind(1d0))::mod2,x(*),t
  integer::i,n

  t=0
  !$omp simd reduction(+:t)
  do i=1,n
     t=t+x(i)*x(i)
  end do

  mod2=t
end function mod2

compiled with -O3 -mavx2 -mfma -fopenmp code similar to

.L8:
        vmovupd (%rax), %ymm2
        addq    $32, %rax
        vfmadd231pd     %ymm2, %ymm2, %ymm0
        cmpq    %rax, %rcx
        jne     .L8

is produced (the above is actually gfortran-13). The corresponding C code
behaves identically.

As each fma depends on the previous fma, this executes in a single fma unit at
one result per the unit's latency. gfortran-14 splits the fma into separate mul
and add, which then executes at the rate of one instruction per add latency,
which, on older processors, beats one instruction per fma latency.

Adding -funroll-loops --param max-unroll-times=2 gives no performance
improvement, and

.L8:
        vmovupd (%rax), %ymm6
        vmovupd 32(%rax), %ymm7
        addq    $64, %rax
        vfmadd231pd     %ymm6, %ymm6, %ymm5
        vfmadd231pd     %ymm7, %ymm7, %ymm5
        cmpq    %rax, %r11
        jne     .L8

changing this to

        vxorpd  %xmm8, %xmm8, %xmm8
.L8:
        vmovupd (%rax), %ymm6
        vmovupd 32(%rax), %ymm7
        addq    $64, %rax
        vfmadd231pd     %ymm6, %ymm6, %ymm5
        vfmadd231pd     %ymm7, %ymm7, %ymm8
        cmpq    %rax, %r11
        jne     .L8
        vaddpd %ymm8, %ymm5, %ymm5

runs twice as fast, as the two fmas in the loop are now independent of each
other. For full performance one might want six or more, as the latency of an
FMA unit is typically 3-5 cycles, and most processors have two.

I think that both OMP simd reduction(+:) and -Ofast should permit this
rearrangement, which is equivalent to simulating vector registers of double the
existing width?

Reply via email to