https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120751
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Last reconfirmed|2025-06-23 00:00:00 |2026-1-19
CC| |jamborm at gcc dot gnu.org
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
I can reproduce a 7% slowdown on Zen4 with -O2 -march=x86-64-v3 [-flto] vs GCC
15.2, Zen2 shows a more pronounced regression:
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=1195.170.0&plot.1=291.170.0&
38.12% 870630 calculix_peak.a calculix_peak.amd64-m64-gcc42-nn [.]
e_c3d_
33.45% 781030 calculix_base.a calculix_base.amd64-m64-gcc42-nn [.]
e_c3d_
+e_c3d.f:689:38: optimized: loop vectorized using 16 byte vectors and unroll
fac
tor 2
+e_c3d.f:689:38: optimized: loop turned into non-loop; it never loops
we are now vectorizing the inner loop in
do k1=1,3
do l1=1,3
s(iii1,jjj1)=s(iii1,jjj1)
& +anisox(i1,k1,j1,l1)*w(k1,l1)*weight
do n1=1,3
s(iii1,jjj1)=s(iii1,jjj1)
& +anisox(m1,k1,n1,l1)
& *w(k1,l1)*vo(i1,m1)*vo(j1,n1)
& *weight
enddo
enddo
enddo
the availability of enough scalar resources is probably better in
hiding latencies here and decoupling the accumulator. I'll also notice
that the unrolled vector version uses FMA a lot which we avoid in the not
vectorized loopy case likely because of heuristics regarding to such
cross-iteration dependences. Building 454.calculix with additional -mno-fma
resolves the regression (in fact speeding up 454.calculix by 2%). This
shows the FMA chain in the containing loop is the ultimate issue.
Martin - is the FMA chain avoidance code too pessimistic in this case?
Possibly confused about the mixed in vector code? The testcase in
gfortran.dg/reassoc_4.f is exactly this loop and reproduces the issue
with -O2 -march=x86-64-v3 or -O2 -mfma.
I'll notice the loop nest is aggressively unrolled at -O3 or with PGO
so this is not an issue there.