https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121290
Tamar Christina <tnfchris at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |tnfchris at gcc dot gnu.org --- Comment #3 from Tamar Christina <tnfchris at gcc dot gnu.org> --- (In reply to Soumya AR from comment #0) > > #define iterations 100000 > #define LEN_1D 32000 > > float a[LEN_1D]; > > int main() > { > for (int i = 0; i < LEN_1D; i++) { > a[i] = (i * 7) % LEN_1D; > } > float x, chksum; > int index; > for (int nl = 0; nl < iterations; nl++) { > x = a[0]; > index = 0; > for (int i = 0; i < LEN_1D; ++i) { > if (a[i] > x) { > x = a[i]; > index = i; > } > } > chksum = x + (float) index; > } > > return index + x > 1; > } > > Now: > > .L4: > movi v23.4s, 0 > mov v24.16b, v26.16b > mov x0, x3 > mov v22.16b, v23.16b > .L3: > ld1r {v1.4s}, [x0], 4 > fcmgt v20.4s, v1.4s, v24.4s > bit v23.16b, v22.16b, v20.16b > bsl v20.16b, v1.16b, v24.16b > add v22.4s, v22.4s, v25.4s > mov v24.16b, v20.16b > cmp x1, x0 > bne .L3 > add w2, w2, 1 > cmp w2, w4 > bne .L4 > dup s23, v23.s[3] > dup s20, v20.s[3] > fmov s21, 1.0e+0 > scvtf s0, s23 > fadd s20, s0, s20 > fcmpe s20, s21 > cset w0, gt > ret > At least this one seems costing related. The vectorized code can't possible be profitable. it's performing scalar calculations as vector by duplicating the single scalar. Other operations like > movi v23.4s, 0 > mov v22.16b, v23.16b > bit v23.16b, v22.16b, v20.16b is quite silly, as it's essentially a move of v20.16b. But yeah this code should definitely be slower than scalar as it *is* scalar. > Before: > .L6: > fmov s25, s1 > movi v26.2d, #0 > mov x0, 0 > .L5: > ldr s0, [x1, x0, lsl 2] > fcmpe s25, s0 > bmi .L7 > .L3: > add x0, x0, 1 > cmp x0, x2 > bne .L5 > subs w3, w3, #1 > bne .L6 > scvtf s26, s26 > fmov s24, 1.0e+0 > fadd s26, s26, s25 > fcmpe s26, s24 > cset w0, gt > ret > .L7: > fmov s26, w0 > fmov s25, s0 > b .L3 > > > Thanks, > Soumya