https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122751
Bug ID: 122751
Summary: missed fusing of sign-extend and MLA
Product: gcc
Version: 16.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: tnfchris at gcc dot gnu.org
Target Milestone: ---
The following example
int foo (char *buf, int len) {
int x;
for (int i =0, y = 0; i < len; i++, y = i * 10) {
x += (int) y * (buf[i] - '0');
}
return x;
}
compiled with -march=armv8-a+sve compiles to
.L3:
ld1b z28.s, p7/z, [x0, x2]
add x2, x2, x3
sub z28.h, z28.h, #48
sxth z28.s, p6/m, z28.s
mla z30.s, p7/m, z28.s, z29.s
incw z29.s, all, mul #10
whilelo p7.s, w2, w1
b.any .L3
uaddv d31, p6, z30.s
fmov w0, s31
ret
but sign extending the low 16-bits of every 32-lane is the same as sign
extending every even element from the vector.
as such sxth + mla above can be merged into smlalb in RTL.
i.e. expected code is
.L3:
ld1b z28.s, p7/z, [x0, x2]
add x2, x2, x3
sub z28.h, z28.h, #48
smlalb z30.s, p7/m, z28.h, z29.h
incw z29.s, all, mul #10
whilelo p7.s, w2, w1
b.any .L3
uaddv d31, p6, z30.s
fmov w0, s31
ret
this becomes even more impactful when the loop is unrolled:
int foo (char *buf, int len) {
int x;
#pragma GCC unroll 8
for (int i =0, y = 0; i < len; i++, y = i * 10) {
x += (int) y * (buf[i] - '0');
}
return x;
}
we generate a mess, due to the predicates also needing to be unpacked:
.L3:
ld1b z27.h, p7/z, [x0, x2]
punpklo p5.h, p7.b
sub z27.h, z27.h, #48
punpkhi p6.h, p7.b
sunpklo z0.s, z27.h
mov z26.d, z29.d
mla z30.s, p5/m, z0.s, z29.s
add x2, x2, x3
sunpkhi z27.s, z27.h
incw z26.s, all, mul #10
whilelo p7.h, w2, w1
add z29.s, z29.s, z28.s
mla z30.s, p6/m, z27.s, z26.s
b.any .L3
uaddv d31, p4, z30.s
fmov w0
this should have just generated smlalb + smlalt.
We can also do that in RTL though eventually we should teach the vectorizer
about top/bottom arithmetic.