[Bug tree-optimization/123190] [16 Regression] 8% slowdown of 433.milc on AMD zen4 since r16-5275-ga645e903e8c394

rguenth at gcc dot gnu.org via Gcc-bugs Wed, 14 Jan 2026 07:33:55 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123190


--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
I have a patch that improves behavior with -O3 where we unroll the inner loop
of mult_su3_na.  With -O2 -flto -fprofile-use we do not do this and for
this function vectorizing the epilog isn't deemed profitable (despite the
fix).  With -O3:

t.c:11:14: note:  operating on full vectors for epilogue loop.
t.c:11:14: note:  Cost model analysis: 
  Vector inside of loop cost: 956
  Vector prologue cost: 288
  Vector epilogue cost: 0
  Scalar iteration cost: 1488
  Scalar outside cost: 0
  Vector outside cost: 288
  prologue iterations: 0
  epilogue iterations: 0
  Calculated minimum iters for profitability: 1
t.c:11:14: note:    Runtime profitability threshold = 1

but -O2:

t.c:12:16: note:  operating on full vectors for epilogue loop.
t.c:12:16: note:  Cost model analysis: 
  Vector inside of loop cost: 436
  Vector prologue cost: 96
  Vector epilogue cost: 0
  Scalar iteration cost: 528
  Scalar outside cost: 0
  Vector outside cost: 96
  prologue iterations: 0
  epilogue iterations: 0 
  Calculated minimum iters for profitability: 2
t.c:12:16: note:    Runtime profitability threshold = 2

note the number of epilog iterations is 1 (but with VF == 1 as well).

In particular load costing for splatpermutes looks off to me though it
matches what we code-generate:

t.c:11:14: note:   ------>vectorizing SLP node starting from: ar_230 =
a_7(D)->e[i_217][1].real;   
t.c:11:14: note:   transform load.
t.c:11:14: note:   create vector_type-pointer variable to type: vector(2)
double  vectorizing a record based array ref: *a_7(D)
t.c:11:14: note:   created vectp_a.125_546
t.c:11:14: note:   add new stmt: vect_ar_230.126_549 = MEM <vector(2) double>
[(double *)vectp_a.124_547];
t.c:11:14: note:   add new stmt: vectp_a.124_550 = vectp_a.124_547 + 16;
t.c:11:14: note:   add new stmt: vect_ar_230.127_551 = MEM <vector(2) double>
[(double *)vectp_a.124_550];
t.c:11:14: note:   add new stmt: vectp_a.124_552 = vectp_a.124_550 + 16;
gimple_simplified to vectp_a.124_552 = vectp_a.124_547 + 32;
t.c:11:14: note:   add new stmt: vect_ar_230.128_553 = MEM <vector(2) double>
[(double *)vectp_a.124_552];
t.c:11:14: note:   add new stmt: vect_ar_220.129_554 = VEC_PERM_EXPR
<vect_ar_230.127_551, vect_ar_230.127_551, { 0, 0 }>;

that's from the VMAT_CONTIGUOUS path.  I have an improvement for this as well.

[Bug tree-optimization/123190] [16 Regression] 8% slowdown of 433.milc on AMD zen4 since r16-5275-ga645e903e8c394

Reply via email to