[Bug tree-optimization/120751] [16 Regression] 10-15% slowdown of 454.calculix on Zen4 and Zen5 since r16-1001-g0291f53f8d2343

rguenth at gcc dot gnu.org via Gcc-bugs Mon, 19 Jan 2026 06:55:25 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120751


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|2025-06-23 00:00:00         |2026-1-19
                 CC|                            |jamborm at gcc dot gnu.org

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
I can reproduce a 7% slowdown on Zen4 with -O2 -march=x86-64-v3 [-flto] vs GCC
15.2, Zen2 shows a more pronounced regression:

https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=1195.170.0&plot.1=291.170.0&;

  38.12%        870630  calculix_peak.a  calculix_peak.amd64-m64-gcc42-nn  [.]
e_c3d_
  33.45%        781030  calculix_base.a  calculix_base.amd64-m64-gcc42-nn  [.]
e_c3d_

+e_c3d.f:689:38: optimized: loop vectorized using 16 byte vectors and unroll
fac
tor 2
+e_c3d.f:689:38: optimized: loop turned into non-loop; it never loops

we are now vectorizing the inner loop in

                         do k1=1,3
                            do l1=1,3
                              s(iii1,jjj1)=s(iii1,jjj1)
     &                         +anisox(i1,k1,j1,l1)*w(k1,l1)*weight
                                do n1=1,3
                                  s(iii1,jjj1)=s(iii1,jjj1)
     &                                  +anisox(m1,k1,n1,l1)
     &                                  *w(k1,l1)*vo(i1,m1)*vo(j1,n1)
     &                                  *weight
                                enddo
                             enddo
                            enddo

the availability of enough scalar resources is probably better in
hiding latencies here and decoupling the accumulator. I'll also notice
that the unrolled vector version uses FMA a lot which we avoid in the not
vectorized loopy case likely because of heuristics regarding to such
cross-iteration dependences.  Building 454.calculix with additional -mno-fma
resolves the regression (in fact speeding up 454.calculix by 2%).  This
shows the FMA chain in the containing loop is the ultimate issue.
Martin - is the FMA chain avoidance code too pessimistic in this case?
Possibly confused about the mixed in vector code?  The testcase in
gfortran.dg/reassoc_4.f is exactly this loop and reproduces the issue
with -O2 -march=x86-64-v3 or -O2 -mfma.

I'll notice the loop nest is aggressively unrolled at -O3 or with PGO
so this is not an issue there.

[Bug tree-optimization/120751] [16 Regression] 10-15% slowdown of 454.calculix on Zen4 and Zen5 since r16-1001-g0291f53f8d2343

Reply via email to