16 regression] wide integer vector performance regression, x86, between gcc-14 and gcc-13 using target clones on skylake platform

rguenth at gcc dot gnu.org via Gcc-bugs Fri, 13 Feb 2026 04:57:27 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115002


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
             Blocks|                            |88670
   Last reconfirmed|                            |2026-02-13
     Ever confirmed|0                           |1
           Priority|P3                          |P2

--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
Well.  GCC 13 simply ignores the

_Pragma("GCC unroll 8")

request for me.  When I remove that performance is better, but GCC 15 (and
trunk) are still slower:

> gcc-13 t.c -O2
> ./a.out 
7280.76 vint8w2048_t ops per sec, duration = 13.73 secs
> gcc-15 t.c -O2 
> ./a.out 
6301.25 vint8w2048_t ops per sec, duration = 15.87 secs

there's still very high register pressure in the loop (obviously).  One issue
is that we fail to eliminate the wide vector PHIs during vector lowering.
We might have a duplicate PR for this.  We also do not lower other copies.

typedef int v64si __attribute__((vector_size (64*4)));

v64si a, b[16];

void foo()
{
  for (int i = 0; i < 16; ++i)
    a += b[i];
}


  <bb 2> [local count: 63136016]:
  a_lsm.4_5 = a;
  ivtmp.12_16 = (unsigned long) &b;
  _19 = ivtmp.12_16 + 4096;

  <bb 3> [local count: 1010605808]:
  # a_lsm.4_13 = PHI <_3(3), a_lsm.4_5(2)>
  # ivtmp.12_4 = PHI <ivtmp.12_9(3), ivtmp.12_16(2)>
  _17 = (void *) ivtmp.12_4;
  _12 = &MEM[(vector(64) int *)_17];
  _26 = BIT_FIELD_REF <MEM[(vector(64) int *)_12], 128, 512>;
...
  _60 = BIT_FIELD_REF <a_lsm.4_13, 128, 1920>;
  _61 = _59 + _60;
  _3 = {_7, _8, _22, _25, _28, _31, _34, _37, _40, _43, _46, _49, _52, _55,
_58, _61};
  ivtmp.12_9 = ivtmp.12_4 + 256;
  if (ivtmp.12_9 != _19)
    goto <bb 3>; [93.75%]
  else
    goto <bb 4>; [6.25%]

  <bb 4> [local count: 63136016]:
  a = _3;

we want to hoist the decomposition/composition of the larger vector out
of the loop.

I'm not sure why GCC 13 was faster, it basically did the same.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88670
[Bug 88670] [meta-bug] generic vector extension issues

[Bug target/115002] [14/15/16 regression] wide integer vector performance regression, x86, between gcc-14 and gcc-13 using target clones on skylake platform

Reply via email to