https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125218
Bug ID: 125218
Summary: [15/16 Regression]AArch64 SVE: conditional reduction
with int→short narrowing fails to vectorize on trunk,
regression from GCC 15.2
Product: gcc
Version: 16.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: bug_hunters at yeah dot net
Target Milestone: ---
**Description:**
GCC trunk fails to vectorize a conditional reduction loop when the accumulated
value involves narrowing from `int` to `short`. The vectorizer reports
"unsupported use in stmt" and generates fully scalar code, even with
`-fvect-cost-model=unlimited`.
GCC 15.2.0 successfully vectorizes the inner loop for this same `int` → `short`
case, using SVE MASK_GATHER_LOAD with predicated narrowing adds and `uaddv`
reduction. Trunk rejects it entirely, with all vector modes from VNx4SI down to
V2SI failing.
**Test case:**
```c
short foo(
const int * __restrict__ a,
const int * __restrict__ b,
int n, int m) {
short sum = 0;
for (int j = m - 1; j >= 0; j -= 1)
{
for (int i = n - 1; i >= 0; i -= 1)
{
int idx = j * m + i;
if ((a[idx] > b[idx] && b[idx] != 0)) {
sum += (short)a[(idx * 2)];
sum += (short)b[(idx * 2)];
}
}
}
return (short)sum;
}
```
**GCC version:**
```
aarch64-unknown-linux-gnu-gcc (GCC) 16.0.1 20260429 (experimental) [trunk]
```
**Compilation options:**
```
-march=armv9-a+sve -ftree-vectorize -O3 -fopt-info-vec-all
```
**GCC trunk output:**
```
<source>:7:27: missed: couldn't vectorize loop
<source>:7:27: missed: not vectorized: loop nest containing two or more
consecutive inner loops cannot be vectorized
<source>:9:31: missed: couldn't vectorize loop
<source>:1:7: missed: not vectorized: unsupported use in stmt.
<source>:1:7: note: vectorized 0 loops in function.
<source>:14:21: note: ***** Analysis failed with vector mode VNx4SI
<source>:14:21: note: ***** The result for vector mode VNx16QI would be the
same
<source>:14:21: note: ***** The result for vector mode VNx8QI would be the same
<source>:14:21: note: ***** Re-trying analysis with vector mode VNx4QI
<source>:14:21: note: ***** Analysis failed with vector mode VNx4QI
<source>:14:21: note: ***** Re-trying analysis with vector mode VNx2QI
<source>:14:21: note: ***** Analysis failed with vector mode VNx2QI
<source>:14:21: note: ***** Re-trying analysis with vector mode V16QI
<source>:14:21: note: ***** Analysis failed with vector mode V16QI
...
```
Generated assembly (fully scalar, no SIMD instructions used):
```assembly
foo:
subs w13, w3, #1
bmi .L9
mul w12, w13, w3
sub w9, w12, #1
lsl w9, w9, 1
cmp w2, 0
bgt .L17
.L4:
sub w9, w9, w3, lsl 1
sub w12, w12, w3
cbz w13, .L9
sub w13, w13, #1
cmp w2, 0
ble .L4
.L17:
mov w11, 0
.L7:
add w7, w12, w2
add w5, w9, w2, lsl 1
mov x4, 0
lsl x7, x7, 2
sub x7, x7, #4
add x10, x0, x7
add x7, x1, x7
b .L6
.L5:
sub w5, w5, #2
cmp w5, w9
beq .L18
.L6:
ldr w6, [x7, x4]
ldr w8, [x10, x4]
sub x4, x4, #4
cmp w6, 0
ccmp w6, w8, 0, ne
bge .L5
ubfiz x8, x5, 2, 32
sub w5, w5, #2
ldr w6, [x1, x8]
ldrh w8, [x0, x8]
add w6, w6, w8
add w11, w6, w11
cmp w5, w9
bne .L6
.L18:
sub w9, w9, w3, lsl 1
sub w12, w12, w3
cbz w13, .L3
sub w13, w13, #1
b .L7
.L9:
mov w11, 0
.L3:
mov w0, w11
ret
```
Also reproducible on Godbolt: https://godbolt.org/z/M8h4jdhxo.
**GCC 15.2.0 (for comparison):**
```
<source>:7:27: missed: couldn't vectorize loop
<source>:7:27: missed: not vectorized: loop nest containing two or more
consecutive inner loops cannot be vectorized
<source>:9:31: optimized: loop vectorized using variable length vectors
<source>:1:7: note: vectorized 1 loops in function.
<source>:1:7: missed: statement clobbers memory: vect_patt_88.31_210 =
.MASK_GATHER_LOAD (_54, vect__8.30_208, 4, { 0, ... }, mask__38.24_193, { 0,
... });
<source>:1:7: missed: statement clobbers memory: vect_patt_88.32_211 =
.MASK_GATHER_LOAD (_54, vect__8.30_209, 4, { 0, ... }, mask__38.24_194, { 0,
... });
<source>:1:7: missed: statement clobbers memory: vect_patt_87.34_213 =
.MASK_GATHER_LOAD (_52, vect__8.30_208, 4, { 0, ... }, mask__38.24_193, { 0,
... });
<source>:1:7: missed: statement clobbers memory: vect_patt_87.35_214 =
.MASK_GATHER_LOAD (_52, vect__8.30_209, 4, { 0, ... }, mask__38.24_194, { 0,
... });
```
Key vectorized portion (inner loop, showing SVE gather + narrowing add + uaddv
reduction):
```assembly
.L4:
ld1w z24.s, p7/z, [x5]
ld1w z2.s, p7/z, [x6]
rev z24.s, z24.s
ld1w z3.s, p7/z, [x5, #-1, mul vl]
ld1w z1.s, p7/z, [x6, #-1, mul vl]
rev z3.s, z3.s
rev z2.s, z2.s
cmpne p5.s, p7/z, z24.s, #0
cmpne p6.s, p7/z, z3.s, #0
cmpgt p5.s, p5/z, z2.s, z24.s
rev z1.s, z1.s
inch x4
cmpgt p6.s, p6/z, z1.s, z3.s
add z28.s, z31.s, z30.s
mov z27.d, z30.d
add z28.s, z28.s, z28.s
uzp1 p4.h, p5.h, p6.h
ld1w z0.s, p5/z, [x0, z28.s, sxtw 2]
ld1w z25.s, p5/z, [x1, z28.s, sxtw 2]
decw z27.s
decb x5, all, mul #2
add z27.s, z27.s, z31.s
decb x6, all, mul #2
add z27.s, z27.s, z27.s
decw z30.s, all, mul #2
ld1w z26.s, p6/z, [x0, z27.s, sxtw 2]
ld1w z23.s, p6/z, [x1, z27.s, sxtw 2]
uzp1 z26.h, z0.h, z26.h
uzp1 z23.h, z25.h, z23.h
add z23.h, z23.h, z26.h
add z29.h, p4/m, z29.h, z23.h
cmp w10, w4
bcs .L4
uaddv d31, p7, z29.h
fmov x5, d31
add w11, w11, w5
```
Also reproducible on Godbolt: https://godbolt.org/z/c9Yrqsqdv.
**Additional notes:**
1. This is a regression from GCC 15.2, which successfully vectorized the inner
loop using SVE MASK_GATHER_LOAD with predicated narrowing adds and `uaddv`
halfword reduction.
2. `-fvect-cost-model=unlimited` has no effect on trunk, indicating this is a
capability failure rather than a cost-model decision.
3. The "unsupported use in stmt" error on trunk occurs at source line 14 (`sum
+= (short)a[(idx * 2)]`), where the narrowing conversion meets the conditional
reduction pattern.