[Bug target/123603] [16 Regression] 13% slowdown of exchange2_r on Zen4 since r16-6767-g948d33f490a6b0

rguenth at gcc dot gnu.org via Gcc-bugs Thu, 15 Jan 2026 05:27:33 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123603


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|tree-optimization           |target

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
Oh, and in full we do

-exchange2.fppized.f90:1207:71: optimized: loop vectorized using 8 byte vectors
and unroll factor 1
+exchange2.fppized.f90:1207:71: optimized: loop vectorized using 16 byte
vectors and unroll factor 2
+exchange2.fppized.f90:1207:71: optimized: epilogue loop vectorized using 8
byte vectors and unroll factor 1
...
-exchange2.fppized.f90:1207:71: optimized: loop with 2 iterations completely
unr
olled (header execution count 22540758)
+exchange2.fppized.f90:1207:71: optimized: loop turned into non-loop; it never
l
oops
+exchange2.fppized.f90:1207:71: optimized: loop turned into non-loop; it never
loops

and we seem to have few kinds of clones for digits_2.

The difference caused by the patch is in costing for the XMM vector loop:

 block[_795] 1 times vec_perm costs 4 in body
-block[_795] 5 times unaligned_load (misalign -1) costs 60 in body
+block[_795] 2 times unaligned_load (misalign -1) costs 24 in body
..
 exchange2.fppized.f90:1207:71: note:  Cost model analysis: 
-  Vector inside of loop cost: 100
+  Vector inside of loop cost: 64

where the originally bogus count is because we have

exchange2.fppized.f90:1207:71: note:   Detected interleaving load of size 9
exchange2.fppized.f90:1207:71: note:    _796 = block[_795];
exchange2.fppized.f90:1207:71: note:    _791 = block[_794];
exchange2.fppized.f90:1207:71: note:    <gap of 7 elements>

so we count loading the gap, but those elements are of course not loaded,
the relevant SLP node being

exchange2.fppized.f90:1207:71: note:   node 0x408c7698 (max_nunits=4, refcnt=2)
vector(4) int   
exchange2.fppized.f90:1207:71: note:   op template: _796 = block[_795];
exchange2.fppized.f90:1207:71: note:    stmt 0 _796 = block[_795];
exchange2.fppized.f90:1207:71: note:    stmt 1 _791 = block[_794];
exchange2.fppized.f90:1207:71: note:    load permutation { 0 1 }

so we end up with

  vect__796.81_951 = MEM <vector(4) int> [(int *)vectp_block.79_955];
  vect__796.83_1041 = MEM <vector(4) int> [(int *)vectp_block.79_969];
  vect__796.86_869 = VEC_PERM_EXPR <vect__796.81_951, vect__796.83_1041, { 0,
1, 5, 6 }>;
  vect__797.87_868 = vect__796.86_869 + { 10, 10, 10, 10 };
  _860 = VIEW_CONVERT_EXPR<vector(2) unsigned long>(vect__797.87_868);
  _859 = BIT_FIELD_REF <_860, 64, 0>;
  MEM <unsigned long> [(int *)ivtmp_862] = _859;
  _855 = BIT_FIELD_REF <_860, 64, 64>;
  MEM <unsigned long> [(int *)ivtmp_856] = _855;

which in hindsight isn't a very clever way of vectorizing and the previous
behavior where we thought we need way more loads and thus SSE width
vectorization is not profitable but ended up doing "emulated" half-SSE
is better.  The slightly more correct costs just do not work out this way.

If we'd eventually cost SSE and SSE/2 against each other we'd face

exchange2.fppized.f90:1207:71: note:  Cost model analysis:
  Vector inside of loop cost: 64
  Vector prologue cost: 12
  Vector epilogue cost: 56
  Scalar iteration cost: 56
  Scalar outside cost: 0
  Vector outside cost: 68
  prologue iterations: 0
  epilogue iterations: 1

vs

exchange2.fppized.f90:1207:71: note:  Cost model analysis:
  Vector inside of loop cost: 28
  Vector prologue cost: 12
  Vector epilogue cost: 0
  Scalar iteration cost: 56
  Scalar outside cost: 0
  Vector outside cost: 12
  prologue iterations: 0
  epilogue iterations: 0

where naiively just looking at Vector inside cost SSE/2 * 2 == 56 < SSE == 64
so SSE/2 would win.  There'd be actual loop iteration cost overhead (twice
more IV increment and branch).  If we could cost main + (vector!) epilogue
together eventually SSE/2 would still win, if anticipating the later
complete unrolling.

But as noted elsehwere x86 doesn't bother doing any cost comparison.

I'd say there's nothing wrong in the vectorizer itself, it's a target
costing issue.

I do plan to experiment with enabling cost compare in the x86 backend, but
this isn't something for stage4.

[Bug target/123603] [16 Regression] 13% slowdown of exchange2_r on Zen4 since r16-6767-g948d33f490a6b0

Reply via email to