https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124913
--- Comment #1 from Hunter X <bug_hunters at yeah dot net> ---
I simplified the testcase to the following:
```c
int foo_store(
int * __restrict__ a,
int * __restrict__ out,
int n
) {
for (int i = n - 1; i >= 0; i--) {
if (a[i] != 1) {
out[i] = a[i];
}
}
return 0;
}
```
With:
```
-march=armv9-a+sve -O3 -ftree-vectorize \
-fopt-info-vec-all -fno-trapping-math \
-fvect-cost-model=unlimited
```
GCC 16.1.0 still fails to vectorize this loop.
However, if I change the loop to a forward iteration:
```c
for (int i = 0; i < n; i++)
```
the loop is successfully vectorized by GCC.
```
foo_store:
cmp w2, 0
ble .L2
mov x3, 0
whilelo p7.s, wzr, w2
.L3:
ld1w z31.s, p7/z, [x0, x3, lsl 2]
cmpne p7.s, p7/z, z31.s, #1
st1w z31.s, p7, [x1, x3, lsl 2]
incw x3
whilelo p7.s, w3, w2
b.any .L3
.L2:
mov w0, 0
ret
```
Does this indicate a limitation in the current AArch64 SVE vectorizer when
handling reverse loops that involve predicate-based storage operations? In
particular, is the failure related to the generation of predicates or mask
constructions for reverse iterations, or is there some other legality issue at
play?
Could you clarify whether this is an intentional limitation/design choice of
the current AArch64 SVE vectorizer, or whether this should be treated as a
missed optimization that could be improved in the future?