https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125522

            Bug ID: 125522
           Summary: [ARM64]SLP build fails for loop with ternary operator
                    and adjacent int-to-double conversion
           Product: gcc
           Version: 16.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: bug_hunters at yeah dot net
  Target Milestone: ---

**Description:**

The test case contains a loop with a ternary operator that creates a
conditional load (`idx >= 1 ? out[idx-1] : 0.0`) and an adjacent pair of `long`
to `double` conversions from a struct member pair (`a[idx].m0` and
`a[idx].m1`).

**Clang** successfully applies SLP vectorization to the two conversions:
- Loads both `long` members with `ldr q1`
- Converts both to `double` simultaneously with `scvtf v1.2d`
- Performs horizontal addition with `faddp d1, v1.2d`

**GCC** attempts SLP vectorization but fails with:
- `not suitable for strided load iftmp.0_34 = .MASK_LOAD (...)` - the ternary
operator's conditional load is considered a strided load that cannot be
vectorized
- `SLP build failed` - the presence of the ternary operator in the same
expression causes SLP construction to abort entirely

**Test case:**
```c
#include <stdint.h>
#include <stddef.h>

typedef struct {
    long m0;
    long m1;
} element_t_0;

short foo(
    const element_t_0 * __restrict__ a,
    double * __restrict__ out,
    int n, int m, int l) {
    for (int j = 0; j < m; j += 1) {
        for (int k = 0; k < l; k += 1) {
            for (int i = 0; i < n; i += 1) {
                int idx = (i * m + j) * l + k;
                out[idx] = ((idx) >= 1 ? out[(idx) - 1] : (double)0) +
                           (((double)a[idx].m0) + ((double)a[idx].m1));
            }
        }
    }
    return (short)0;
}
```

**GCC version:**
```
aarch64-unknown-linux-gnu-gcc (GCC) 16.0.1 20260429 (experimental) 
```

**Compilation options:**
```
-march=armv9-a+sve -ftree-vectorize -O3 -fopt-info-vec-all -fno-trapping-math
-fvect-cost-model=unlimited
```

**GCC opt-info output:**
```
<source>:14:23: missed: couldn't vectorize loop
<source>:14:23: missed: not vectorized: loop nest containing two or more
consecutive inner loops cannot be vectorized
<source>:16:27: missed: couldn't vectorize loop
<source>:21:57: missed: not vectorized: not suitable for strided load
iftmp.0_34 = .MASK_LOAD (_6, 64B, _102, 0.0);
<source>:18:31: missed: couldn't vectorize loop
<source>:18:31: missed: SLP build failed.
<source>:9:7: note: vectorized 0 loops in function.
```

**GCC assembly (key loop portion - scalar only):**
```assembly
.L5:
        movi    d31, #0
        add     w8, w8, 1
        cbz     w7, .L4
        ldr     d31, [x6, -8]
.L4:
        ldp     d29, d30, [x5]
        add     w7, w7, w9
        add     x5, x5, x11
        scvtf   d29, d29
        scvtf   d30, d30
        fadd    d29, d30, d29
        fadd    d29, d29, d31
        str     d29, [x6]
        add     x6, x6, x10
```

Note: GCC uses `ldp` to load both `long` members but performs scalar `scvtf`
and `fadd`.

Also reproducible on Godbolt:
https://godbolt.org/z/P7naKjM6b

**Clang version:**
```
clang version 22.1.4
Target: aarch64-unknown-linux-gnu
```

**Clang compilation options:**
```
-S -O3 -ftree-vectorize -ftree-slp-vectorize --target=aarch64-linux-gnu
-march=armv9-a+sve -Rpass=.*vectorize.* -Rpass-missed=.*vectorize.*
-Rpass-analysis=.*vectorize.*
```

**Clang opt-info output:**
```
<source>:21:26: remark: loop not vectorized: unsafe dependent memory operations
in loop.
<source>:18:13: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
<source>:21:74: remark: SLP vectorized with cost -1 and with tree size 2
[-Rpass=slp-vectorizer]
```

**Clang assembly (key vectorized portion):**
```assembly
.LBB0_8:
        ldur    d0, [x20, #-8]
.LBB0_9:
        ldr     q1, [x19]
        scvtf   v1.2d, v1.2d
        faddp   d1, v1.2d
        fadd    d0, d0, d1
        str     d0, [x20]
```

Also reproducible on Godbolt:
https://godbolt.org/z/8cWa8Mvra

**Additional notes:**
1. **SLP vectorization is attempted but fails in GCC:** The error message `not
suitable for strided load iftmp.0_34 = .MASK_LOAD` indicates GCC treats the
ternary operator's conditional load as a strided load pattern that cannot be
vectorized. This then causes the entire SLP build to fail (`SLP build failed`).

2. **Clang succeeds:** Despite the same ternary operator and loop-carried
dependency, Clang's SLP vectorizer successfully vectorizes `((double)a[idx].m0)
+ ((double)a[idx].m1)` independently. The ternary operator does not block SLP
vectorization.

3. **Pattern summary:** The missed optimization is:
   - Two adjacent 64-bit integer loads from the same struct
   - Conversion to `double` (both)
   - Addition of the two converted values (horizontal operation)

Reply via email to