https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124097

            Bug ID: 124097
           Summary: Transform if (a[i] > 0) sum += a[i] into sum += max
                    (0, a[i]) when vectorising
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: ktkachov at gcc dot gnu.org
  Target Milestone: ---

Example source:
float foo(float *a) {
  float sum = 0.;
  for (int i = 0; i < 32000; i++)
    if (a[i] > (float)0.)
      sum += a[i];
  return sum;
}


For aarch64 SVE GCC generates the loop:
.L3:
        ld1w    z28.s, p7/z, [x1]
        inch    x2
        fcmgt   p6.s, p7/z, z28.s, #0.0
        ld1w    z29.s, p7/z, [x1, #1, mul vl]
        fadd    z31.s, p6/m, z31.s, z28.s
        incb    x1, all, mul #2
        fcmgt   p6.s, p7/z, z29.s, #0.0
        fadd    z30.s, p6/m, z30.s, z29.s
        cmp     w2, w3
        bls     .L3

whereas LLVM does:
.LBB0_1:
        ldp     q3, q4, [x8, #-16]
        subs    x9, x9, #8
        add     x8, x8, #32
        fmaxnm  v3.4s, v3.4s, v0.4s
        fmaxnm  v4.4s, v4.4s, v0.4s
        fadd    v1.4s, v3.4s, v1.4s
        fadd    v2.4s, v4.4s, v2.4s
        b.ne    .LBB0_1

i.e. it transforms the loop body into sum += fmaxf (0.0f, a[i]).
The GCC code should do something similar, at least for AArch64 SVE as the FCMGT
SVE operations typically have worse throughput than the FMAXNM ones as they
write a result predicate rather than a vector Z register

Reply via email to