https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125523

            Bug ID: 125523
           Summary: [AArch64]GCC fails to SLP vectorize 4-int horizontal
                    reduction when mixed with other scalar operations in
                    same loop
           Product: gcc
           Version: 16.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: bug_hunters at yeah dot net
  Target Milestone: ---

**Description:**

The test case contains a loop that accumulates values from two different struct
arrays:
- One `long` value from `a[idx*3].m0`
- Four `int` values from `b[idx*3].m0` through `b[idx*3].m3`

The four `int` members of `b` are consecutive in memory (16 bytes total) and
are a natural fit for SIMD horizontal reduction: load all four with a single
128-bit load, then sum them with a single SIMD instruction.

**Clang** successfully vectorizes this pattern:
- Loads all four `int` members with `ldr q0`
- Uses `saddlv d0, v0.4s` to sum them horizontally in one instruction
- Handles the `a` member as a separate scalar operation

**GCC** fails to vectorize the four `int` reduction:
- Loads `b.m0` and `b.m1` with `ldrsw` (scalar)
- Uses `ldpsw` to load `b.m1` and `b.m2` together (memory optimization only)
- Loads `b.m3` separately
- Performs scalar additions for each member

The presence of `a.m0` (a scalar operation from a different struct) appears to
prevent GCC's SLP vectorizer from recognizing and packing the four consecutive
`int` members of `b`.

**Test case:**
```c
#include <stdint.h>
#include <stddef.h>

typedef struct {
    long m0;
} element_t_0;

typedef struct {
    int m0;
    int m1;
    int m2;
    int m3;
} element_t_1;

double foo(
    const element_t_0 * __restrict__ a,
    const element_t_1 * __restrict__ b,
    int n, int m, int l) {
    long sum = 0;
    for (int i = n - 1; i >= 0; i -= 1) {
        for (int j = m - 1; j >= 0; j -= 4) {
            for (int k = l - 1; k >= 0; k -= 1) {
                int idx = (i * m + j) * l + k;
                sum += (long)a[(idx * 3)].m0;
                sum += (long)b[(idx * 3)].m0;
                sum += (long)b[(idx * 3)].m1;
                sum += (long)b[(idx * 3)].m2;
                sum += (long)b[(idx * 3)].m3;
                if ((b[idx].m0 < 0)) {
                    break;
                }
            }
        }
    }
    return (double)sum;
}
```

**GCC version:**
```
aarch64-unknown-linux-gnu-gcc (GCC) 16.0.1 20260429 (experimental) 
```

**Compilation options:**
```
-march=armv9-a+sve -ftree-vectorize -O3 -fopt-info-vec-all -fno-trapping-math
-fvect-cost-model=unlimited
```

**GCC assembly (key loop portion):**
```assembly
.L10:
        ldr     x7, [x9]              ; a.m0 (scalar)
        ldrsw   x6, [x2]              ; b.m0 (scalar)
        add     x3, x3, x7            ; add a.m0
        add     x3, x6, x3            ; add b.m0
        ldpsw   x7, x6, [x2, 4]       ; b.m1 and b.m2 (load pair only, not
SIMD)
        add     x7, x7, x3            ; add b.m1
        add     x6, x6, x7            ; add b.m2
        ldrsw   x3, [x2, 60]          ; b.m3 (scalar)
        add     x3, x3, x6            ; add b.m3
```

Note: GCC uses `ldpsw` to load two 32-bit values together, but this is a memory
pairing optimization, not SIMD vectorization. Each value is still added
separately.
Also reproducible on Godbolt:
https://godbolt.org/z/6Yh1jqc8a

**Clang version:**
```
clang version 22.1.4
Target: aarch64-unknown-linux-gnu
```

**Clang compilation options:**
```
-S -O3 -ftree-vectorize -ftree-slp-vectorize --target=aarch64-linux-gnu
-march=armv9-a+sve -Rpass=.*vectorize.* -Rpass-missed=.*vectorize.*
-Rpass-analysis=.*vectorize.*
```

**Clang assembly (key vectorized portion):**
```assembly
.LBB0_7:
        ldr     q0, [x19], #-48       ; load b.m0,b.m1,b.m2,b.m3 into SIMD
register
        ldr     x22, [x20], #-24      ; load a.m0 (scalar)
        saddlv  d0, v0.4s             ; horizontal sum of all 4 ints
        fmov    x23, d0
        add     x9, x23, x9           ; add b sum to total
        add     x9, x9, x22           ; add a.m0 to total
```
Also reproducible on Godbolt:
https://godbolt.org/z/5x3cnEox5
**Additional notes:**

1. **The missed optimization:** Four consecutive `int` members (`m0` through
`m3`) of `b` occupy 16 contiguous bytes and can be loaded with a single 128-bit
SIMD load, then summed horizontally with `saddlv`. This is a 4x reduction in
load and add instructions.

2. **When GCC succeeds:** GCC successfully vectorizes this pattern when
**only** the four `b` members are present (no `a.m0` in the loop body).

3. **When GCC fails:** Adding a single scalar operation from a different struct
(`a.m0`) prevents GCC from recognizing the SIMD opportunity on `b`'s members.
GCC only uses `ldpsw` (load pair) as a memory optimization, not full SIMD.

4. **Clang handles mixed sources correctly:** Clang independently vectorizes
the four `b` members while keeping `a.m0` as a scalar operation.

Reply via email to