https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121290

--- Comment #15 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to Soumya AR from comment #13)
> Hi Tamar,
> 
> Thanks for the fix.
> 
> This has now brought back performance for the mentioned kernels with -Ofast
> but is now regressing with -O3 ...
> 
> Is this something you're still looking at? Just wanted to put it up here
> anyway.
> 
> Example, for the inner loop in s314:
> 
> #define iterations 100000
> #define LEN_1D 32000
> 
> float a[LEN_1D];
> 
> int main()
> {
>     for (int i = 0; i < LEN_1D; i++) {
>         a[i] = i;
>     }
>     
>     float x;
>     for (int nl = 0; nl < iterations*5; nl++) {
>         x = a[0];
>         for (int i = 0; i < LEN_1D; i++) {
>             if (a[i] > x) {
>                 x = a[i];
>             }
>         }
>     }
>     
>     return x;
> } 
> 
> Now:
> 
> .L3:
>       ldr     s25, [x0], 4
>       fcmpe   s25, s26
>       fcsel   s26, s25, s26, gt
>       cmp     x0, x1
>       bne     .L3
>       subs    w2, w2, #1
>       bne     .L4
>       fcvtzs  w0, s26
>       ret
> 
> 
> Before:
> 
> .L3:
>       ld1r    {v25.4s}, [x0], 4
>       fcmgt   v24.4s, v25.4s, v26.4s
>       bsl     v24.16b, v25.16b, v26.16b
>       mov     v26.16b, v24.16b
>       cmp     x1, x0
>       bne     .L3
>       add     w2, w2, 1
>       cmp     w2, w4
>       bne     .L4
>       dup     s24, v24.s[3]
>       fcvtzs  w0, s24
>       ret
> 
> 
> Looks like we don't vectorize the inner loop at all now.

Yeah we shouldn't it should do outer-loop vect like it did in GCC 15,
interestingly trunk seems to segfault in vect on this loop. so taking a look at
that first https://godbolt.org/z/Wj5GWox5s

Reply via email to