https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125522
Bug ID: 125522
Summary: [ARM64]SLP build fails for loop with ternary operator
and adjacent int-to-double conversion
Product: gcc
Version: 16.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: bug_hunters at yeah dot net
Target Milestone: ---
**Description:**
The test case contains a loop with a ternary operator that creates a
conditional load (`idx >= 1 ? out[idx-1] : 0.0`) and an adjacent pair of `long`
to `double` conversions from a struct member pair (`a[idx].m0` and
`a[idx].m1`).
**Clang** successfully applies SLP vectorization to the two conversions:
- Loads both `long` members with `ldr q1`
- Converts both to `double` simultaneously with `scvtf v1.2d`
- Performs horizontal addition with `faddp d1, v1.2d`
**GCC** attempts SLP vectorization but fails with:
- `not suitable for strided load iftmp.0_34 = .MASK_LOAD (...)` - the ternary
operator's conditional load is considered a strided load that cannot be
vectorized
- `SLP build failed` - the presence of the ternary operator in the same
expression causes SLP construction to abort entirely
**Test case:**
```c
#include <stdint.h>
#include <stddef.h>
typedef struct {
long m0;
long m1;
} element_t_0;
short foo(
const element_t_0 * __restrict__ a,
double * __restrict__ out,
int n, int m, int l) {
for (int j = 0; j < m; j += 1) {
for (int k = 0; k < l; k += 1) {
for (int i = 0; i < n; i += 1) {
int idx = (i * m + j) * l + k;
out[idx] = ((idx) >= 1 ? out[(idx) - 1] : (double)0) +
(((double)a[idx].m0) + ((double)a[idx].m1));
}
}
}
return (short)0;
}
```
**GCC version:**
```
aarch64-unknown-linux-gnu-gcc (GCC) 16.0.1 20260429 (experimental)
```
**Compilation options:**
```
-march=armv9-a+sve -ftree-vectorize -O3 -fopt-info-vec-all -fno-trapping-math
-fvect-cost-model=unlimited
```
**GCC opt-info output:**
```
<source>:14:23: missed: couldn't vectorize loop
<source>:14:23: missed: not vectorized: loop nest containing two or more
consecutive inner loops cannot be vectorized
<source>:16:27: missed: couldn't vectorize loop
<source>:21:57: missed: not vectorized: not suitable for strided load
iftmp.0_34 = .MASK_LOAD (_6, 64B, _102, 0.0);
<source>:18:31: missed: couldn't vectorize loop
<source>:18:31: missed: SLP build failed.
<source>:9:7: note: vectorized 0 loops in function.
```
**GCC assembly (key loop portion - scalar only):**
```assembly
.L5:
movi d31, #0
add w8, w8, 1
cbz w7, .L4
ldr d31, [x6, -8]
.L4:
ldp d29, d30, [x5]
add w7, w7, w9
add x5, x5, x11
scvtf d29, d29
scvtf d30, d30
fadd d29, d30, d29
fadd d29, d29, d31
str d29, [x6]
add x6, x6, x10
```
Note: GCC uses `ldp` to load both `long` members but performs scalar `scvtf`
and `fadd`.
Also reproducible on Godbolt:
https://godbolt.org/z/P7naKjM6b
**Clang version:**
```
clang version 22.1.4
Target: aarch64-unknown-linux-gnu
```
**Clang compilation options:**
```
-S -O3 -ftree-vectorize -ftree-slp-vectorize --target=aarch64-linux-gnu
-march=armv9-a+sve -Rpass=.*vectorize.* -Rpass-missed=.*vectorize.*
-Rpass-analysis=.*vectorize.*
```
**Clang opt-info output:**
```
<source>:21:26: remark: loop not vectorized: unsafe dependent memory operations
in loop.
<source>:18:13: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
<source>:21:74: remark: SLP vectorized with cost -1 and with tree size 2
[-Rpass=slp-vectorizer]
```
**Clang assembly (key vectorized portion):**
```assembly
.LBB0_8:
ldur d0, [x20, #-8]
.LBB0_9:
ldr q1, [x19]
scvtf v1.2d, v1.2d
faddp d1, v1.2d
fadd d0, d0, d1
str d0, [x20]
```
Also reproducible on Godbolt:
https://godbolt.org/z/8cWa8Mvra
**Additional notes:**
1. **SLP vectorization is attempted but fails in GCC:** The error message `not
suitable for strided load iftmp.0_34 = .MASK_LOAD` indicates GCC treats the
ternary operator's conditional load as a strided load pattern that cannot be
vectorized. This then causes the entire SLP build to fail (`SLP build failed`).
2. **Clang succeeds:** Despite the same ternary operator and loop-carried
dependency, Clang's SLP vectorizer successfully vectorizes `((double)a[idx].m0)
+ ((double)a[idx].m1)` independently. The ternary operator does not block SLP
vectorization.
3. **Pattern summary:** The missed optimization is:
- Two adjacent 64-bit integer loads from the same struct
- Conversion to `double` (both)
- Addition of the two converted values (horizontal operation)