https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125521
Bug ID: 125521
Summary: [ARM64] Missed SLP vectorization for adjacent
int-to-float conversion and horizontal addition
(unsupported SLP instances)
Product: gcc
Version: 16.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: bug_hunters at yeah dot net
Target Milestone: ---
**Description:**
The test case involves a loop with an early exit (`break`), which prevents loop
vectorization. However, the loop body contains independent operations that are
candidates for SLP (Superword-Level Parallelism) vectorization: loading two
adjacent `int32_t` members (`m0`, `m1`) from a struct (`element_t_0`),
converting them to `float`, adding them together, and then adding a third
`float` loaded from another struct (`element_t_1`).
**Clang** successfully applies SLP vectorization to the loop body, generating
SIMD code:
- `ldr d0` to load both integers into a 64-bit register
- `scvtf v0.2s` to convert both to `float` simultaneously
- `faddp` to perform horizontal addition
**GCC** with `-fno-trapping-math -fvect-cost-model=unlimited` reports
`unsupported SLP instances` and fails to vectorize, producing only scalar
conversions even though memory access is optimized with `ldp`.
**Test case:**
```c
#include <stdint.h>
#include <stddef.h>
typedef struct {
int32_t m0;
int32_t m1;
} element_t_0;
typedef struct {
int32_t m0;
} element_t_1;
float foo(
const element_t_0 * __restrict__ a,
const element_t_1 * __restrict__ b,
float * __restrict__ out,
int n
) {
for (int i = n - 1; i >= 0; i -= 1)
{
int idx = i;
out[idx] = (((float)a[(idx + 22)].m0) +
((float)a[(idx + 22)].m1) +
((float)b[(idx + 22)].m0));
if ((b[idx].m0 < 1)) {
break;
}
}
return (float)0;
}
```
**GCC version:**
```
aarch64-unknown-linux-gnu-gcc (GCC) 16.0.1 20260429 (experimental) [trunk]
```
**Compilation options:**
```
-O3 -march=armv9-a+sve -fno-trapping-math -fvect-cost-model=unlimited
-fopt-info-vec-all
```
**GCC opt-info output:**
```
<source>:19:27: missed: couldn't vectorize loop
<source>:19:27: missed: unsupported SLP instances
<source>:13:7: note: vectorized 0 loops in function.
<source>:27:12: note: ***** Analysis failed with vector mode VNx4SI
<source>:27:12: note: ***** Skipping vector mode VNx16QI, which would repeat
the analysis for VNx4SI
```
**GCC assembly (key loop portion):**
```assembly
.L3:
ldp s0, s30, [x0, 168]
sub x1, x1, #4
ldr s31, [x1, 88]
sub x0, x0, #8
ldr w3, [x1]
sub x2, x2, #4
scvtf s0, s0
scvtf s30, s30
scvtf s31, s31
fadd s30, s0, s30
fadd s31, s30, s31
str s31, [x2, 4]
cmp w3, 0
bgt .L3
```
Also reproducible on Godbolt:
https://godbolt.org/z/Pcej7c66f
**Clang version:**
```
clang version 22.1.4
Target: aarch64-unknown-linux-gnu
```
**Clang compilation options:**
```
-O3 -march=armv9-a+sve -Rpass=.*vectorize.* -Rpass-missed=.*vectorize.*
-Rpass-analysis=.*vectorize.*
```
**Clang opt-info output:**
```
<source>:19:5: remark: loop not vectorized: Cannot vectorize early exit loop
[-Rpass-analysis=loop-vectorize]
<source>:19:5: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
<source>:22:22: remark: SLP vectorized with cost -1 and with tree size 2
[-Rpass=slp-vectorizer]
```
**Clang assembly (key vectorized portion):**
```assembly
.LBB0_1:
ldr d0, [x9, x11, lsl #3]
ldr s1, [x10, #88]
scvtf v0.2s, v0.2s
scvtf s1, s1
faddp s0, v0.2s
fadd s0, s0, s1
str s0, [x8, x11, lsl #2]
```
Also reproducible on Godbolt:
https://godbolt.org/z/v5vPc7bo7
**Additional notes:**
1. **Loop vectorization is correctly rejected:** The early exit (`break`)
prevents loop vectorization in both GCC and Clang.
2. **SLP vectorization is attempted but fails in GCC:** With
`-fno-trapping-math -fvect-cost-model=unlimited`, GCC explicitly reports
`unsupported SLP instances`, indicating that SLP analysis was attempted but
failed.
3. **Clang succeeds:** Clang's SLP vectorizer successfully handles this pattern
even though loop vectorization is rejected.