https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124097
Bug ID: 124097
Summary: Transform if (a[i] > 0) sum += a[i] into sum += max
(0, a[i]) when vectorising
Product: gcc
Version: unknown
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: ktkachov at gcc dot gnu.org
Target Milestone: ---
Example source:
float foo(float *a) {
float sum = 0.;
for (int i = 0; i < 32000; i++)
if (a[i] > (float)0.)
sum += a[i];
return sum;
}
For aarch64 SVE GCC generates the loop:
.L3:
ld1w z28.s, p7/z, [x1]
inch x2
fcmgt p6.s, p7/z, z28.s, #0.0
ld1w z29.s, p7/z, [x1, #1, mul vl]
fadd z31.s, p6/m, z31.s, z28.s
incb x1, all, mul #2
fcmgt p6.s, p7/z, z29.s, #0.0
fadd z30.s, p6/m, z30.s, z29.s
cmp w2, w3
bls .L3
whereas LLVM does:
.LBB0_1:
ldp q3, q4, [x8, #-16]
subs x9, x9, #8
add x8, x8, #32
fmaxnm v3.4s, v3.4s, v0.4s
fmaxnm v4.4s, v4.4s, v0.4s
fadd v1.4s, v3.4s, v1.4s
fadd v2.4s, v4.4s, v2.4s
b.ne .LBB0_1
i.e. it transforms the loop body into sum += fmaxf (0.0f, a[i]).
The GCC code should do something similar, at least for AArch64 SVE as the FCMGT
SVE operations typically have worse throughput than the FMAXNM ones as they
write a result predicate rather than a vector Z register