[Bug target/123603] [16 Regression] 13% slowdown of exchange2_r on Zen4 since r16-6767-g948d33f490a6b0

rguenth at gcc dot gnu.org via Gcc-bugs Fri, 16 Jan 2026 01:14:29 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123603


--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
Testcase, needs -O2 -ftree-vectorize (a better cost model than 'very-cheap',
since while we allow peeling for niter we do not allow peeling for gaps).

void foo (int *block)
{
  for (int i = 0; i < 3; ++i)
    {
      int a = block[i*9];
      int b = block[i*9+1];
      block[i*9] = a + 10;
      block[i*9+1] = b + 10;
    }
} 

Apart from the fact that we fail to consider the SSE/2 version because of
already mentioned reason we fail to realize that peeling for gaps isn't
necessary if we'd emit short loads - but in this case the odd stride of 9
together with a VF of 2 and doing contiguous loads means we load
3 elements in the last meaningful V4SI which is not power-of-two.
Realizing this and realizing gap > nunits we might fall back to strided
operation (VMAT_STRIDED_SLP) in such cases.  That would allow for
cheaper vector composition, on x86 a movq + movhps.

[Bug target/123603] [16 Regression] 13% slowdown of exchange2_r on Zen4 since r16-6767-g948d33f490a6b0

Reply via email to