https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123190
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
Compared to GCC 15.2 I can reproduce a ~10% slowdown on Zen4 for both PRs
combined. A perf profile points out (base is GCC 15, peak GCC 16):
14.47% 130071 milc_peak.amd64 milc_peak.amd64-m64-gcc42-nn [.]
u_shift_fermion
12.37% 111743 milc_base.amd64 milc_base.amd64-m64-gcc42-nn [.]
scalar_mult_add_su3_matrix
12.07% 109548 milc_peak.amd64 milc_peak.amd64-m64-gcc42-nn [.]
scalar_mult_add_su3_matrix
11.61% 103770 milc_base.amd64 milc_base.amd64-m64-gcc42-nn [.]
u_shift_fermion
9.52% 97952 milc_peak.amd64 milc_peak.amd64-m64-gcc42-nn [.]
mult_su3_na
7.57% 78713 milc_base.amd64 milc_base.amd64-m64-gcc42-nn [.]
mult_su3_na
5.40% 51737 milc_base.amd64 milc_base.amd64-m64-gcc42-nn [.]
mult_su3_nn
5.37% 52655 milc_peak.amd64 milc_peak.amd64-m64-gcc42-nn [.]
mult_su3_nn
2.79% 25217 milc_base.amd64 milc_base.amd64-m64-gcc42-nn [.]
mult_su3_an
2.62% 23754 milc_peak.amd64 milc_peak.amd64-m64-gcc42-nn [.]
mult_su3_an
2.33% 26147 milc_base.amd64 milc_base.amd64-m64-gcc42-nn [.]
path_product
2.30% 26739 milc_peak.amd64 milc_peak.amd64-m64-gcc42-nn [.]
path_product
were there's a clear regression in u_shift_fermion and mult_su3_na. The
former is from quark_stuff.c and the latter from m_mat_na.c: -fopt-info
differences for those are
+quark_stuff.c:162:34: optimized: sinking common stores to vec[1]
but locations with -flto might be misleading.
Without -flto the regression is bigger (18%), dropping -fprofile-use in
favor of using -O3 produces a similar result. Thus -O3 -march=x86-64-v3
is what I'm looking at now instead. The profile difference there is even more
pronounced:
11.98% 108891 milc_base.amd64 milc_base.amd64-m64-gcc42-nn [.]
scalar_mult_add_su3_matrix
9.84% 101290 milc_peak.amd64 milc_peak.amd64-m64-gcc42-nn [.]
mult_su3_na
8.34% 75326 milc_peak.amd64 milc_peak.amd64-m64-gcc42-nn [.]
su3_projector
8.31% 78868 milc_peak.amd64 milc_peak.amd64-m64-gcc42-nn [.]
mult_su3_nn
7.40% 66823 milc_peak.amd64 milc_peak.amd64-m64-gcc42-nn [.]
mult_su3_mat_vec
7.03% 63486 milc_peak.amd64 milc_peak.amd64-m64-gcc42-nn [.]
mult_adj_su3_mat_vec
6.35% 61230 milc_base.amd64 milc_base.amd64-m64-gcc42-nn [.]
mult_su3_nn
5.88% 53026 milc_base.amd64 milc_base.amd64-m64-gcc42-nn [.]
mult_su3_mat_vec
5.70% 51277 milc_base.amd64 milc_base.amd64-m64-gcc42-nn [.]
mult_adj_su3_mat_vec
5.22% 58362 milc_base.amd64 milc_base.amd64-m64-gcc42-nn [.]
mult_su3_na
4.78% 44280 milc_peak.amd64 milc_peak.amd64-m64-gcc42-nn [.]
scalar_mult_add_su3_matrix
1.52% 18881 milc_peak.amd64 milc_peak.amd64-m64-gcc42-nn [.]
su3mat_copy
1.46% 18574 milc_base.amd64 milc_base.amd64-m64-gcc42-nn [.]
su3mat_copy
note that scalar_mult_add_su3_matrix is faster witrh GCC 16 but mult_su3_na
is a lot slower. We can see
-m_mat_na.c:31:14: optimized: loop vectorized using 16 byte vectors
+m_mat_na.c:31:14: optimized: loop vectorized using 32 byte vectors and unroll
factor 2
meaning we're using AVX2 vectors and SLP with two lanes now. IIRC we've
tuned for this kernel in the past.
t.c:12:16: optimized: loop with 3 iterations completely unrolled (header
execution count 268435456)
t.c:11:14: optimized: loop vectorized using 32 byte vectors and unroll factor 2
t.c:11:14: optimized: loop versioned for vectorization because of possible
aliasing
t.c:11:14: optimized: loop with 2 iterations completely unrolled (header
execution count 17895697)
t.c:11:14: optimized: loop turned into non-loop; it never loops
t.c:14:11: optimized: loop turned into non-loop; it never loops
t.c:30:24: optimized: sinking common stores to c_43(D)->e[2][2].imag
t.c:29:24: optimized: sinking common stores to c_43(D)->e[2][2].real
we end up with not vectorizing the epilog because
t.c:11:14: note: Decided to SLP 1 instances. Unrolling factor 1
...
t.c:11:14: note: operating on full vectors for epilogue loop.
t.c:11:14: missed: not vectorized: loop only has a single scalar iteration.
t.c:11:14: missed: Loop costings not worthwhile.
t.c:11:14: note: ***** Analysis failed with vector mode V16QI
huh. So this is related to peeling for gaps which we avoid by partial
loads but do not anticipate for here:
t.c:11:14: note: Detected interleaving load of size 6
t.c:11:14: note: ar_88 = a_7(D)->e[i_3][0].real;
t.c:11:14: note: ai_89 = a_7(D)->e[i_3][0].imag;
t.c:11:14: note: ar_137 = a_7(D)->e[i_3][1].real;
t.c:11:14: note: ai_138 = a_7(D)->e[i_3][1].imag;
t.c:11:14: note: <gap of 2 elements>
This is because
t.c:11:14: note: Queuing group with duplicate access for fixup
so some missed CSE due to unrolling and the possible aliasing of the
destination 'c' with the source 'a'/'b' exposed after unrolling. We're
not CSEing again after deciding to version the loop for the aliasing.
This also results in redundant loads. But possibly LTO mitigates some
of this issue.
The way we handle the 6-lane SLP with vector(4) double store
is also a bit odd and likely adds to the slowdown compared to the
more straight-forward vector(2) double vectorization. But x86 does not
compare costs and vectorizing with AVX2 is better than not vectorizing.
So ... we do have to open up that can of worms eventually.
typedef struct {
double real;
double imag;
} complex;
typedef struct { complex e[3][3]; } su3_matrix;
void mult_su3_na( su3_matrix *a, su3_matrix *b, su3_matrix *c ){
int i,j;
register double t,ar,ai,br,bi,cr,ci;
for(i=0;i<3;i++)
for(j=0;j<3;j++){
ar=a->e[i][0].real; ai=a->e[i][0].imag;
br=b->e[j][0].real; bi=b->e[j][0].imag;
cr=ar*br; t=ai*bi; cr += t;
ci=ai*br; t=ar*bi; ci -= t;
ar=a->e[i][1].real; ai=a->e[i][1].imag;
br=b->e[j][1].real; bi=b->e[j][1].imag;
t=ar*br; cr += t; t=ai*bi; cr += t;
t=ar*bi; ci -= t; t=ai*br; ci += t;
ar=a->e[i][2].real; ai=a->e[i][2].imag;
br=b->e[j][2].real; bi=b->e[j][2].imag;
t=ar*br; cr += t; t=ai*bi; cr += t;
t=ar*bi; ci -= t; t=ai*br; ci += t;
c->e[i][j].real=cr;
c->e[i][j].imag=ci;
}
}