https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125218

            Bug ID: 125218
           Summary: [15/16 Regression]AArch64 SVE: conditional reduction
                    with int→short narrowing fails to vectorize on trunk,
                    regression from GCC 15.2
           Product: gcc
           Version: 16.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: bug_hunters at yeah dot net
  Target Milestone: ---

**Description:**
GCC trunk fails to vectorize a conditional reduction loop when the accumulated
value involves narrowing from `int` to `short`. The vectorizer reports
"unsupported use in stmt" and generates fully scalar code, even with
`-fvect-cost-model=unlimited`.

GCC 15.2.0 successfully vectorizes the inner loop for this same `int` → `short`
case, using SVE MASK_GATHER_LOAD with predicated narrowing adds and `uaddv`
reduction. Trunk rejects it entirely, with all vector modes from VNx4SI down to
V2SI failing.

**Test case:**
```c
short foo(
    const int * __restrict__ a,
    const int * __restrict__ b,
    int n, int m) {
    short sum = 0;
    for (int j = m - 1; j >= 0; j -= 1)
    {
        for (int i = n - 1; i >= 0; i -= 1)
        {
            int idx = j * m + i;
            if ((a[idx] > b[idx] && b[idx] != 0)) {
                sum += (short)a[(idx * 2)];
                sum += (short)b[(idx * 2)];
            }
        }
    }
    return (short)sum;
}
```

**GCC version:**
```
aarch64-unknown-linux-gnu-gcc (GCC) 16.0.1 20260429 (experimental) [trunk]
```

**Compilation options:**
```
-march=armv9-a+sve -ftree-vectorize -O3 -fopt-info-vec-all 
```

**GCC trunk output:**
```
<source>:7:27: missed: couldn't vectorize loop
<source>:7:27: missed: not vectorized: loop nest containing two or more
consecutive inner loops cannot be vectorized
<source>:9:31: missed: couldn't vectorize loop
<source>:1:7: missed: not vectorized: unsupported use in stmt.
<source>:1:7: note: vectorized 0 loops in function.
<source>:14:21: note: ***** Analysis failed with vector mode VNx4SI
<source>:14:21: note: ***** The result for vector mode VNx16QI would be the
same
<source>:14:21: note: ***** The result for vector mode VNx8QI would be the same
<source>:14:21: note: ***** Re-trying analysis with vector mode VNx4QI
<source>:14:21: note: ***** Analysis failed with vector mode VNx4QI
<source>:14:21: note: ***** Re-trying analysis with vector mode VNx2QI
<source>:14:21: note: ***** Analysis failed with vector mode VNx2QI
<source>:14:21: note: ***** Re-trying analysis with vector mode V16QI
<source>:14:21: note: ***** Analysis failed with vector mode V16QI
...
```

Generated assembly (fully scalar, no SIMD instructions used):
```assembly
foo:
        subs    w13, w3, #1
        bmi     .L9
        mul     w12, w13, w3
        sub     w9, w12, #1
        lsl     w9, w9, 1
        cmp     w2, 0
        bgt     .L17
.L4:
        sub     w9, w9, w3, lsl 1
        sub     w12, w12, w3
        cbz     w13, .L9
        sub     w13, w13, #1
        cmp     w2, 0
        ble     .L4
.L17:
        mov     w11, 0
.L7:
        add     w7, w12, w2
        add     w5, w9, w2, lsl 1
        mov     x4, 0
        lsl     x7, x7, 2
        sub     x7, x7, #4
        add     x10, x0, x7
        add     x7, x1, x7
        b       .L6
.L5:
        sub     w5, w5, #2
        cmp     w5, w9
        beq     .L18
.L6:
        ldr     w6, [x7, x4]
        ldr     w8, [x10, x4]
        sub     x4, x4, #4
        cmp     w6, 0
        ccmp    w6, w8, 0, ne
        bge     .L5
        ubfiz   x8, x5, 2, 32
        sub     w5, w5, #2
        ldr     w6, [x1, x8]
        ldrh    w8, [x0, x8]
        add     w6, w6, w8
        add     w11, w6, w11
        cmp     w5, w9
        bne     .L6
.L18:
        sub     w9, w9, w3, lsl 1
        sub     w12, w12, w3
        cbz     w13, .L3
        sub     w13, w13, #1
        b       .L7
.L9:
        mov     w11, 0
.L3:
        mov     w0, w11
        ret
```

Also reproducible on Godbolt: https://godbolt.org/z/M8h4jdhxo.

**GCC 15.2.0 (for comparison):**
```
<source>:7:27: missed: couldn't vectorize loop
<source>:7:27: missed: not vectorized: loop nest containing two or more
consecutive inner loops cannot be vectorized
<source>:9:31: optimized: loop vectorized using variable length vectors
<source>:1:7: note: vectorized 1 loops in function.
<source>:1:7: missed: statement clobbers memory: vect_patt_88.31_210 =
.MASK_GATHER_LOAD (_54, vect__8.30_208, 4, { 0, ... }, mask__38.24_193, { 0,
... });
<source>:1:7: missed: statement clobbers memory: vect_patt_88.32_211 =
.MASK_GATHER_LOAD (_54, vect__8.30_209, 4, { 0, ... }, mask__38.24_194, { 0,
... });
<source>:1:7: missed: statement clobbers memory: vect_patt_87.34_213 =
.MASK_GATHER_LOAD (_52, vect__8.30_208, 4, { 0, ... }, mask__38.24_193, { 0,
... });
<source>:1:7: missed: statement clobbers memory: vect_patt_87.35_214 =
.MASK_GATHER_LOAD (_52, vect__8.30_209, 4, { 0, ... }, mask__38.24_194, { 0,
... });
```

Key vectorized portion (inner loop, showing SVE gather + narrowing add + uaddv
reduction):
```assembly
.L4:
        ld1w    z24.s, p7/z, [x5]
        ld1w    z2.s, p7/z, [x6]
        rev     z24.s, z24.s
        ld1w    z3.s, p7/z, [x5, #-1, mul vl]
        ld1w    z1.s, p7/z, [x6, #-1, mul vl]
        rev     z3.s, z3.s
        rev     z2.s, z2.s
        cmpne   p5.s, p7/z, z24.s, #0
        cmpne   p6.s, p7/z, z3.s, #0
        cmpgt   p5.s, p5/z, z2.s, z24.s
        rev     z1.s, z1.s
        inch    x4
        cmpgt   p6.s, p6/z, z1.s, z3.s
        add     z28.s, z31.s, z30.s
        mov     z27.d, z30.d
        add     z28.s, z28.s, z28.s
        uzp1    p4.h, p5.h, p6.h
        ld1w    z0.s, p5/z, [x0, z28.s, sxtw 2]
        ld1w    z25.s, p5/z, [x1, z28.s, sxtw 2]
        decw    z27.s
        decb    x5, all, mul #2
        add     z27.s, z27.s, z31.s
        decb    x6, all, mul #2
        add     z27.s, z27.s, z27.s
        decw    z30.s, all, mul #2
        ld1w    z26.s, p6/z, [x0, z27.s, sxtw 2]
        ld1w    z23.s, p6/z, [x1, z27.s, sxtw 2]
        uzp1    z26.h, z0.h, z26.h
        uzp1    z23.h, z25.h, z23.h
        add     z23.h, z23.h, z26.h
        add     z29.h, p4/m, z29.h, z23.h
        cmp     w10, w4
        bcs     .L4
        uaddv   d31, p7, z29.h
        fmov    x5, d31
        add     w11, w11, w5
```

Also reproducible on Godbolt: https://godbolt.org/z/c9Yrqsqdv. 

**Additional notes:**

1. This is a regression from GCC 15.2, which successfully vectorized the inner
loop using SVE MASK_GATHER_LOAD with predicated narrowing adds and `uaddv`
halfword reduction.

2. `-fvect-cost-model=unlimited` has no effect on trunk, indicating this is a
capability failure rather than a cost-model decision.

3. The "unsupported use in stmt" error on trunk occurs at source line 14 (`sum
+= (short)a[(idx * 2)]`), where the narrowing conversion meets the conditional
reduction pattern.

Reply via email to