https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125248
Bug ID: 125248
Summary: GCC generates unreachable SVE vector code for
prefix-sum loop with distance-1 dependence — dead code
from alias check provably foldable at compile time
Product: gcc
Version: 16.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: bug_hunters at yeah dot net
Target Milestone: ---
**Description:**
GCC trunk generates a vectorized SVE path for a prefix-sum (scan) loop where
the inter-iteration dependence distance is statically 1. At runtime, the alias
check guarding the vector path **always fails** because the dependence distance
(1) is smaller than the minimum SVE vector width (vscale × 4 ≥ 4 elements for
32-bit types). The vector code is never executed, resulting in dead code
emission, binary size increase, I-cache waste, and a redundant runtime alias
check on every inner loop entry.
**Test case:**
```c
#include <stdint.h>
#include <stddef.h>
typedef struct {
double m0;
} element_t_0;
unsigned int foo(
const element_t_0 *__restrict__ a,
unsigned int *__restrict__ out,
int n, int m)
{
for (int i = 0; i < n; i += 1)
{
for (int j = 0; j < m; j += 1)
{
int idx = i * m + j;
out[idx] = ((idx) >= 1 ? out[(idx) - 1] : (unsigned int)0)
+ (((unsigned int)a[idx].m0))
+ (unsigned int)(4);
}
}
return (unsigned int)0;
}
```
**GCC version:**
```
aarch64-unknown-linux-gnu-gcc (GCC) 16.0.1 20260429 (experimental) [trunk]
```
**Compilation options:**
```
-O3 -march=armv9-a+sve -ftree-vectorize -fopt-info-vec-all -fno-trapping-math
```
**GCC trunk output:**
```
<source>:14:27: optimized: loop vectorized using variable length vectors
<source>:14:27: optimized: loop versioned for vectorization because of possible
aliasing
<source>:7:14: note: vectorized 1 loops in function.
<source>:12:23: missed: couldn't vectorize loop
<source>:12:23: missed: splitting region at dominance boundary bb36
<source>:14:27: note: ***** Analysis failed with vector mode VOID
<source>:22:12: note: ***** Analysis failed with vector mode VOID
```
**Liveness analysis of the alias check:**
The alias check computes:
```
x8 = x1 - 4 ; computed before the loop
whilewr p14.s, x8, x1 ; check: (x1 - x8) / 4 >= VL?
; i.e.: 4 / 4 >= vscale * 4?
; i.e.: 1 >= vscale * 4?
b.nlast .L8 ; if false → scalar loop
```
Since `vscale ≥ 1` on all SVE implementations, `vscale × 4 ≥ 4`, so `1 ≥ vscale
× 4` is **always false**. The branch to the scalar loop is **always taken**.
The entire vector path (`.L4`, ~20 SVE instructions including predicated
loads/stores, converts, uzp1, sel) is unreachable dead code.
Confirmed on Godbolt: https://godbolt.org/z/T9eEPYEPj
**Clang 22.1.4(for comparison):**
Clang output:
```
<source>:17:22: remark: loop not vectorized: unsafe dependent memory operations
in loop. Use #pragma clang loop distribute(enable) to allow loop distribution
to attempt to isolate the offending operations into a separate loop
Backward loop carried data dependence. Memory location is the same as accessed
at example.c:17:38 [-Rpass-analysis=loop-vectorize]
17 | out[idx] = ((idx) >= 1 ? out[(idx) - 1] : (unsigned int)0)
| ^
<source>:14:9: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
14 | for (int j = 0; j < m; j += 1)
...
```
Clang correctly identifies the loop-carried dependence (distance 1) and
conservatively refuses to vectorize, emitting only scalar code. No dead vector
path is generated.
Also reproducible on Godbolt: https://godbolt.org/z/3o6oEY3v1
**Impact:**
1. **Dead code emission**: ~20 unreachable SVE instructions per loop nest.
2. **Runtime overhead**: `whilewr` + `b.nlast` executed on every inner loop
entry despite the compile-time-provable outcome.
3. **I-cache pressure**: Dead vector code occupies cache lines.
**Additional notes:**
- The inner loop is a prefix-sum (scan) pattern. True auto-vectorization on SVE
would require intra-vector horizontal propagation (log2(vl) steps) plus
cross-chunk carry-forward.
- This behavior was confirmed by LLVM developers reviewing a related Clang
issue, who noted the GCC vector path is unreachable.
- Without SVE (`-march=armv9-a`), no vectorization is attempted — correct
behavior.