https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121290

Tamar Christina <tnfchris at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |tnfchris at gcc dot gnu.org

--- Comment #3 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to Soumya AR from comment #0)
> 
> #define iterations 100000
> #define LEN_1D 32000
> 
> float a[LEN_1D];
> 
> int main()
> {
>     for (int i = 0; i < LEN_1D; i++) {
>         a[i] = (i * 7) % LEN_1D;
>     }
>     float x, chksum;
>     int index;
>     for (int nl = 0; nl < iterations; nl++) {
>         x = a[0];
>         index = 0;
>         for (int i = 0; i < LEN_1D; ++i) {
>             if (a[i] > x) {
>                 x = a[i];
>                 index = i;
>             }
>         }
>         chksum = x + (float) index;
>     }
>     
>     return index + x > 1;
> } 
> 
> Now:
> 
> .L4:
>         movi    v23.4s, 0
>         mov     v24.16b, v26.16b
>         mov     x0, x3
>         mov     v22.16b, v23.16b
> .L3:
>         ld1r    {v1.4s}, [x0], 4
>         fcmgt   v20.4s, v1.4s, v24.4s
>         bit     v23.16b, v22.16b, v20.16b
>         bsl     v20.16b, v1.16b, v24.16b
>         add     v22.4s, v22.4s, v25.4s
>         mov     v24.16b, v20.16b
>         cmp     x1, x0
>         bne     .L3
>         add     w2, w2, 1
>         cmp     w2, w4
>         bne     .L4
>         dup     s23, v23.s[3]
>         dup     s20, v20.s[3]
>         fmov    s21, 1.0e+0
>         scvtf   s0, s23
>         fadd    s20, s0, s20
>         fcmpe   s20, s21
>         cset    w0, gt
>         ret
> 

At least this one seems costing related. The vectorized code
can't possible be profitable. it's performing scalar calculations
as vector by duplicating the single scalar.

Other operations like 

>         movi    v23.4s, 0
>         mov     v22.16b, v23.16b
>         bit     v23.16b, v22.16b, v20.16b

is quite silly, as it's essentially a move of v20.16b.
But yeah this code should definitely be slower than scalar as it *is* scalar.

> Before:
> .L6:
>         fmov    s25, s1
>         movi    v26.2d, #0
>         mov     x0, 0
> .L5:
>         ldr     s0, [x1, x0, lsl 2]
>         fcmpe   s25, s0
>         bmi     .L7
> .L3:
>         add     x0, x0, 1
>         cmp     x0, x2
>         bne     .L5
>         subs    w3, w3, #1
>         bne     .L6
>         scvtf   s26, s26
>         fmov    s24, 1.0e+0
>         fadd    s26, s26, s25
>         fcmpe   s26, s24
>         cset    w0, gt
>         ret
> .L7:
>         fmov    s26, w0
>         fmov    s25, s0
>         b       .L3
> 
> 
> Thanks,
> Soumya

Reply via email to