https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115002
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |NEW
Blocks| |88670
Last reconfirmed| |2026-02-13
Ever confirmed|0 |1
Priority|P3 |P2
--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
Well. GCC 13 simply ignores the
_Pragma("GCC unroll 8")
request for me. When I remove that performance is better, but GCC 15 (and
trunk) are still slower:
> gcc-13 t.c -O2
> ./a.out
7280.76 vint8w2048_t ops per sec, duration = 13.73 secs
> gcc-15 t.c -O2
> ./a.out
6301.25 vint8w2048_t ops per sec, duration = 15.87 secs
there's still very high register pressure in the loop (obviously). One issue
is that we fail to eliminate the wide vector PHIs during vector lowering.
We might have a duplicate PR for this. We also do not lower other copies.
typedef int v64si __attribute__((vector_size (64*4)));
v64si a, b[16];
void foo()
{
for (int i = 0; i < 16; ++i)
a += b[i];
}
<bb 2> [local count: 63136016]:
a_lsm.4_5 = a;
ivtmp.12_16 = (unsigned long) &b;
_19 = ivtmp.12_16 + 4096;
<bb 3> [local count: 1010605808]:
# a_lsm.4_13 = PHI <_3(3), a_lsm.4_5(2)>
# ivtmp.12_4 = PHI <ivtmp.12_9(3), ivtmp.12_16(2)>
_17 = (void *) ivtmp.12_4;
_12 = &MEM[(vector(64) int *)_17];
_26 = BIT_FIELD_REF <MEM[(vector(64) int *)_12], 128, 512>;
...
_60 = BIT_FIELD_REF <a_lsm.4_13, 128, 1920>;
_61 = _59 + _60;
_3 = {_7, _8, _22, _25, _28, _31, _34, _37, _40, _43, _46, _49, _52, _55,
_58, _61};
ivtmp.12_9 = ivtmp.12_4 + 256;
if (ivtmp.12_9 != _19)
goto <bb 3>; [93.75%]
else
goto <bb 4>; [6.25%]
<bb 4> [local count: 63136016]:
a = _3;
we want to hoist the decomposition/composition of the larger vector out
of the loop.
I'm not sure why GCC 13 was faster, it basically did the same.
Referenced Bugs:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88670
[Bug 88670] [meta-bug] generic vector extension issues