https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125521

            Bug ID: 125521
           Summary: [ARM64] Missed SLP vectorization for adjacent
                    int-to-float conversion and horizontal addition
                    (unsupported SLP instances)
           Product: gcc
           Version: 16.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: bug_hunters at yeah dot net
  Target Milestone: ---

**Description:**

The test case involves a loop with an early exit (`break`), which prevents loop
vectorization. However, the loop body contains independent operations that are
candidates for SLP (Superword-Level Parallelism) vectorization: loading two
adjacent `int32_t` members (`m0`, `m1`) from a struct (`element_t_0`),
converting them to `float`, adding them together, and then adding a third
`float` loaded from another struct (`element_t_1`).

**Clang** successfully applies SLP vectorization to the loop body, generating
SIMD code:
- `ldr d0` to load both integers into a 64-bit register
- `scvtf v0.2s` to convert both to `float` simultaneously
- `faddp` to perform horizontal addition

**GCC** with `-fno-trapping-math -fvect-cost-model=unlimited` reports
`unsupported SLP instances` and fails to vectorize, producing only scalar
conversions even though memory access is optimized with `ldp`.

**Test case:**
```c
#include <stdint.h>
#include <stddef.h>

typedef struct {
    int32_t m0;
    int32_t m1;
} element_t_0;

typedef struct {
    int32_t m0;
} element_t_1;

float foo(
    const element_t_0 * __restrict__ a,
    const element_t_1 * __restrict__ b,
    float * __restrict__ out,
    int n
) {
    for (int i = n - 1; i >= 0; i -= 1)
    {
        int idx = i;
        out[idx] = (((float)a[(idx + 22)].m0) +
                    ((float)a[(idx + 22)].m1) +
                    ((float)b[(idx + 22)].m0));
        if ((b[idx].m0 < 1)) {
            break;
        }
    }
    return (float)0;
}
```

**GCC version:**
```
aarch64-unknown-linux-gnu-gcc (GCC) 16.0.1 20260429 (experimental) [trunk]
```

**Compilation options:**
```
-O3 -march=armv9-a+sve -fno-trapping-math -fvect-cost-model=unlimited
-fopt-info-vec-all
```

**GCC opt-info output:**
```
<source>:19:27: missed: couldn't vectorize loop
<source>:19:27: missed: unsupported SLP instances
<source>:13:7: note: vectorized 0 loops in function.
<source>:27:12: note: ***** Analysis failed with vector mode VNx4SI
<source>:27:12: note: ***** Skipping vector mode VNx16QI, which would repeat
the analysis for VNx4SI
```

**GCC assembly (key loop portion):**
```assembly
.L3:
        ldp     s0, s30, [x0, 168]
        sub     x1, x1, #4
        ldr     s31, [x1, 88]
        sub     x0, x0, #8
        ldr     w3, [x1]
        sub     x2, x2, #4
        scvtf   s0, s0
        scvtf   s30, s30
        scvtf   s31, s31
        fadd    s30, s0, s30
        fadd    s31, s30, s31
        str     s31, [x2, 4]
        cmp     w3, 0
        bgt     .L3
```

Also reproducible on Godbolt:
https://godbolt.org/z/Pcej7c66f

**Clang version:**
```
clang version 22.1.4
Target: aarch64-unknown-linux-gnu
```

**Clang compilation options:**
```
-O3 -march=armv9-a+sve -Rpass=.*vectorize.* -Rpass-missed=.*vectorize.*
-Rpass-analysis=.*vectorize.*
```

**Clang opt-info output:**
```
<source>:19:5: remark: loop not vectorized: Cannot vectorize early exit loop
[-Rpass-analysis=loop-vectorize]
<source>:19:5: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
<source>:22:22: remark: SLP vectorized with cost -1 and with tree size 2
[-Rpass=slp-vectorizer]
```

**Clang assembly (key vectorized portion):**
```assembly
.LBB0_1:
        ldr     d0, [x9, x11, lsl #3]
        ldr     s1, [x10, #88]
        scvtf   v0.2s, v0.2s
        scvtf   s1, s1
        faddp   s0, v0.2s
        fadd    s0, s0, s1
        str     s0, [x8, x11, lsl #2]
```

Also reproducible on Godbolt:
https://godbolt.org/z/v5vPc7bo7

**Additional notes:**

1. **Loop vectorization is correctly rejected:** The early exit (`break`)
prevents loop vectorization in both GCC and Clang.

2. **SLP vectorization is attempted but fails in GCC:** With
`-fno-trapping-math -fvect-cost-model=unlimited`, GCC explicitly reports
`unsupported SLP instances`, indicating that SLP analysis was attempted but
failed.

3. **Clang succeeds:** Clang's SLP vectorizer successfully handles this pattern
even though loop vectorization is rejected.

Reply via email to