https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121290
--- Comment #14 from Tamar Christina <tnfchris at gcc dot gnu.org> --- (In reply to Soumya AR from comment #13) > Hi Tamar, > > Thanks for the fix. > > This has now brought back performance for the mentioned kernels with -Ofast > but is now regressing with -O3 ... > > Is this something you're still looking at? Just wanted to put it up here > anyway. > > Example, for the inner loop in s314: > > #define iterations 100000 > #define LEN_1D 32000 > > float a[LEN_1D]; > > int main() > { > for (int i = 0; i < LEN_1D; i++) { > a[i] = i; > } > > float x; > for (int nl = 0; nl < iterations*5; nl++) { > x = a[0]; > for (int i = 0; i < LEN_1D; i++) { > if (a[i] > x) { > x = a[i]; > } > } > } > > return x; > } > > Now: > > .L3: > ldr s25, [x0], 4 > fcmpe s25, s26 > fcsel s26, s25, s26, gt > cmp x0, x1 > bne .L3 > subs w2, w2, #1 > bne .L4 > fcvtzs w0, s26 > ret > > > Before: > > .L3: > ld1r {v25.4s}, [x0], 4 > fcmgt v24.4s, v25.4s, v26.4s > bsl v24.16b, v25.16b, v26.16b > mov v26.16b, v24.16b > cmp x1, x0 > bne .L3 > add w2, w2, 1 > cmp w2, w4 > bne .L4 > dup s24, v24.s[3] > fcvtzs w0, s24 > ret > > > Looks like we don't vectorize the inner loop at all now. Yeah we shouldn't it should do outer-loop vect like it did in GCC 15, interestingly trunk seems to segfault in vect on this loop. so taking a look at that first https://godbolt.org/z/Wj5GWox5s