[Bug tree-optimization/123190] [16 Regression] 8% slowdown of 433.milc on AMD zen4 since r16-5275-ga645e903e8c394

rguenth at gcc dot gnu.org via Gcc-bugs Tue, 13 Jan 2026 07:29:14 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123190


--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 16 (AVX2) cost model:

t.c:11:14: note:  Cost model analysis:
  Vector inside of loop cost: 1292
  Vector prologue cost: 316
  Vector epilogue cost: 1488
  Scalar iteration cost: 1488
  Scalar outside cost: 8
  Vector outside cost: 1804
  prologue iterations: 0
  epilogue iterations: 1
  Calculated minimum iters for profitability: 2

GCC 16 (SSE2) cost model [-mprefer-vector-width=128]:

t.c:11:14: note:  Cost model analysis: 
  Vector inside of loop cost: 956 
  Vector prologue cost: 316
  Vector epilogue cost: 0 
  Scalar iteration cost: 1488
  Scalar outside cost: 8
  Vector outside cost: 316
  prologue iterations: 0
  epilogue iterations: 0
  Calculated minimum iters for profitability: 1
t.c:11:14: note:    Runtime profitability threshold = 1

given constant iteration count the SSE2 loop costs 956 * 3 vs. AVX2 1292 + 1488
which makes the AVX2 loop cheaper on paper with the very much higer
outside cost turning the tide to the SSE2 vector width.  Iff we were actually
going to do cost comparison on x86.  It's too late to do this for GCC 16
though.  With -mprefer-vector-width=128 GCC 16 is faster overall and the
difference in this function is much lower.

[Bug tree-optimization/123190] [16 Regression] 8% slowdown of 433.milc on AMD zen4 since r16-5275-ga645e903e8c394

Reply via email to