https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123190

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
Compared to GCC 15.2 I can reproduce a ~10% slowdown on Zen4 for both PRs
combined.  A perf profile points out (base is GCC 15, peak GCC 16):

  14.47%        130071  milc_peak.amd64  milc_peak.amd64-m64-gcc42-nn  [.]
u_shift_fermion
  12.37%        111743  milc_base.amd64  milc_base.amd64-m64-gcc42-nn  [.]
scalar_mult_add_su3_matrix
  12.07%        109548  milc_peak.amd64  milc_peak.amd64-m64-gcc42-nn  [.]
scalar_mult_add_su3_matrix
  11.61%        103770  milc_base.amd64  milc_base.amd64-m64-gcc42-nn  [.]
u_shift_fermion
   9.52%         97952  milc_peak.amd64  milc_peak.amd64-m64-gcc42-nn  [.]
mult_su3_na
   7.57%         78713  milc_base.amd64  milc_base.amd64-m64-gcc42-nn  [.]
mult_su3_na
   5.40%         51737  milc_base.amd64  milc_base.amd64-m64-gcc42-nn  [.]
mult_su3_nn
   5.37%         52655  milc_peak.amd64  milc_peak.amd64-m64-gcc42-nn  [.]
mult_su3_nn
   2.79%         25217  milc_base.amd64  milc_base.amd64-m64-gcc42-nn  [.]
mult_su3_an
   2.62%         23754  milc_peak.amd64  milc_peak.amd64-m64-gcc42-nn  [.]
mult_su3_an
   2.33%         26147  milc_base.amd64  milc_base.amd64-m64-gcc42-nn  [.]
path_product
   2.30%         26739  milc_peak.amd64  milc_peak.amd64-m64-gcc42-nn  [.]
path_product

were there's a clear regression in u_shift_fermion and mult_su3_na.  The
former is from quark_stuff.c and the latter from m_mat_na.c: -fopt-info
differences for those are

+quark_stuff.c:162:34: optimized: sinking common stores to vec[1]

but locations with -flto might be misleading.

Without -flto the regression is bigger (18%), dropping -fprofile-use in
favor of using -O3 produces a similar result.  Thus -O3 -march=x86-64-v3
is what I'm looking at now instead.  The profile difference there is even more
pronounced:

  11.98%        108891  milc_base.amd64  milc_base.amd64-m64-gcc42-nn  [.]
scalar_mult_add_su3_matrix
   9.84%        101290  milc_peak.amd64  milc_peak.amd64-m64-gcc42-nn  [.]
mult_su3_na
   8.34%         75326  milc_peak.amd64  milc_peak.amd64-m64-gcc42-nn  [.]
su3_projector
   8.31%         78868  milc_peak.amd64  milc_peak.amd64-m64-gcc42-nn  [.]
mult_su3_nn
   7.40%         66823  milc_peak.amd64  milc_peak.amd64-m64-gcc42-nn  [.]
mult_su3_mat_vec
   7.03%         63486  milc_peak.amd64  milc_peak.amd64-m64-gcc42-nn  [.]
mult_adj_su3_mat_vec
   6.35%         61230  milc_base.amd64  milc_base.amd64-m64-gcc42-nn  [.]
mult_su3_nn
   5.88%         53026  milc_base.amd64  milc_base.amd64-m64-gcc42-nn  [.]
mult_su3_mat_vec
   5.70%         51277  milc_base.amd64  milc_base.amd64-m64-gcc42-nn  [.]
mult_adj_su3_mat_vec
   5.22%         58362  milc_base.amd64  milc_base.amd64-m64-gcc42-nn  [.]
mult_su3_na
   4.78%         44280  milc_peak.amd64  milc_peak.amd64-m64-gcc42-nn  [.]
scalar_mult_add_su3_matrix
   1.52%         18881  milc_peak.amd64  milc_peak.amd64-m64-gcc42-nn  [.]
su3mat_copy
   1.46%         18574  milc_base.amd64  milc_base.amd64-m64-gcc42-nn  [.]
su3mat_copy

note that scalar_mult_add_su3_matrix is faster witrh GCC 16 but mult_su3_na
is a lot slower.  We can see

-m_mat_na.c:31:14: optimized: loop vectorized using 16 byte vectors
+m_mat_na.c:31:14: optimized: loop vectorized using 32 byte vectors and unroll
factor 2

meaning we're using AVX2 vectors and SLP with two lanes now.  IIRC we've
tuned for this kernel in the past.

t.c:12:16: optimized: loop with 3 iterations completely unrolled (header
execution count 268435456)
t.c:11:14: optimized: loop vectorized using 32 byte vectors and unroll factor 2
t.c:11:14: optimized:  loop versioned for vectorization because of possible
aliasing
t.c:11:14: optimized: loop with 2 iterations completely unrolled (header
execution count 17895697)
t.c:11:14: optimized: loop turned into non-loop; it never loops
t.c:14:11: optimized: loop turned into non-loop; it never loops
t.c:30:24: optimized: sinking common stores to c_43(D)->e[2][2].imag
t.c:29:24: optimized: sinking common stores to c_43(D)->e[2][2].real

we end up with not vectorizing the epilog because

t.c:11:14: note:   Decided to SLP 1 instances. Unrolling factor 1
...
t.c:11:14: note:  operating on full vectors for epilogue loop.
t.c:11:14: missed:  not vectorized: loop only has a single scalar iteration.
t.c:11:14: missed:  Loop costings not worthwhile.
t.c:11:14: note:  ***** Analysis failed with vector mode V16QI

huh.  So this is related to peeling for gaps which we avoid by partial
loads but do not anticipate for here:

t.c:11:14: note:   Detected interleaving load of size 6
t.c:11:14: note:        ar_88 = a_7(D)->e[i_3][0].real;
t.c:11:14: note:        ai_89 = a_7(D)->e[i_3][0].imag;
t.c:11:14: note:        ar_137 = a_7(D)->e[i_3][1].real;
t.c:11:14: note:        ai_138 = a_7(D)->e[i_3][1].imag;
t.c:11:14: note:        <gap of 2 elements>

This is because

t.c:11:14: note:   Queuing group with duplicate access for fixup

so some missed CSE due to unrolling and the possible aliasing of the
destination 'c' with the source 'a'/'b' exposed after unrolling.  We're
not CSEing again after deciding to version the loop for the aliasing.
This also results in redundant loads.  But possibly LTO mitigates some
of this issue.

The way we handle the 6-lane SLP with vector(4) double store
is also a bit odd and likely adds to the slowdown compared to the
more straight-forward vector(2) double vectorization.  But x86 does not
compare costs and vectorizing with AVX2 is better than not vectorizing.
So ... we do have to open up that can of worms eventually.

typedef struct {
   double real;
   double imag;
} complex;

typedef struct { complex e[3][3]; } su3_matrix;

void mult_su3_na( su3_matrix *a, su3_matrix *b, su3_matrix *c ){
int i,j;
register double t,ar,ai,br,bi,cr,ci;
    for(i=0;i<3;i++)
      for(j=0;j<3;j++){

        ar=a->e[i][0].real; ai=a->e[i][0].imag;
        br=b->e[j][0].real; bi=b->e[j][0].imag;
        cr=ar*br; t=ai*bi; cr += t;
        ci=ai*br; t=ar*bi; ci -= t;

        ar=a->e[i][1].real; ai=a->e[i][1].imag;
        br=b->e[j][1].real; bi=b->e[j][1].imag;
        t=ar*br; cr += t; t=ai*bi; cr += t;
        t=ar*bi; ci -= t; t=ai*br; ci += t;

        ar=a->e[i][2].real; ai=a->e[i][2].imag;
        br=b->e[j][2].real; bi=b->e[j][2].imag;
        t=ar*br; cr += t; t=ai*bi; cr += t;
        t=ar*bi; ci -= t; t=ai*br; ci += t;

        c->e[i][j].real=cr;
        c->e[i][j].imag=ci;
    }
}

Reply via email to