https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122221
Bug ID: 122221
Summary: riscv: reassoc spoils widening opportunity
Product: gcc
Version: 16.0
Status: UNCONFIRMED
Severity: enhancement
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: rdapp at gcc dot gnu.org
Target Milestone: ---
Target: riscv
Split off from PR121959.
For the following code
typedef unsigned char uint8_t;
typedef unsigned short uint16_t;
typedef unsigned int uint32_t;
void
lul (int *restrict res, uint8_t *restrict a, uint8_t *restrict b, int n)
{
for (int i = 0; i < n; i++)
res[i] = (a[i] - b[i]) + 1;
}
we generate with -march=rv64gcv (eliding scalar code):
vsetvli a5,a3,e16,mf2,ta,ma
vle8.v v2,0(a1)
vle8.v v3,0(a2)
vzext.vf2 v1,v2
vadd.vi v1,v1,1
vsetvli zero,zero,e8,mf4,ta,ma
vwsubu.wv v2,v1,v3
vsetvli zero,zero,e32,m1,ta,ma
vsext.vf2 v1,v2
vse32.v v1,0(a0)
(7 vector insns without vsetvl)
Ideally we'd have:
vmv.v.i v4,1 # before the loop
vsetvli a5,a3,e16,mf2,ta,ma
vle8.v v2,0(a1)
vle8.v v3,0(a2)
vwsubu.vv v1,v2,v3
vsetvli zero,zero,e32,m1,ta,ma
vwadd.vv v1,v1,v4 # Could also be vwadd.vx v1,v1,t0
vse32.v v1,0(a0)
(5 vector insns without vsetvl)
The culprit is the add that could be a widening add. Widening adds usually
have less throughput, though, as they still need to do the same amount of
work).
So it's not really clear cut but IMHO worth exploring.
In .vect
vect__3.8_129 = .MASK_LEN_LOAD (vectp_a.6_126, 8B, { -1, ... }, _128(D),
_150, 0);
vect_patt_13.9_130 = (vector([4,4]) unsigned short) vect__3.8_129;
vect__6.12_135 = .MASK_LEN_LOAD (vectp_b.10_132, 8B, { -1, ... }, _134(D),
_150, 0);
vect_patt_47.13_136 = (vector([4,4]) unsigned short) vect__6.12_135;
vect_patt_49.14_137 = vect_patt_13.9_130 - vect_patt_47.13_136;
vect_patt_50.15_138 = VIEW_CONVERT_EXPR<vector([4,4]) signed
short>(vect_patt_49.14_137);
vect_patt_53.17_140 = vect_patt_49.14_137 + { 1, ... };
vect_patt_54.18_141 = VIEW_CONVERT_EXPR<vector([4,4]) signed
short>(vect_patt_53.17_140);
vect_patt_55.19_142 = (vector([4,4]) int) vect_patt_54.18_141;
we still add 1 to the result of the subtraction.
During reassoc we break up the subtraction into a + -b and
reorder a + -b + 1 into a + 1 + -b.
This then results in
aa = add_widen_both (a, 1)
diff = sub_widen_single (aa, b)
extend (diff)
which basically wastes the widening operations by using them on
two individual operands instead of two steps.
Not sure how to prevent reassoc from doing that, short of recognizing
a widening sub right away. We don't currently support this IIRC
because the widen-arithmetic IFNs expect even/odd or lo/hi rather than
our direct scheme.
Without reassoc we have:
vsetvli a5,a3,e8,mf4,ta,ma
vle8.v v2,0(a2)
vle8.v v3,0(a1)
vwsubu.vv v1,v3,v2
vsetvli zero,zero,e16,mf2,ta,ma
vadd.vi v1,v1,1
vsetvli zero,zero,e32,m1,ta,ma
vsext.vf2 v2,v1
vse32.v v2,0(a0)
Still, we do not transform vadd + vsext/vzext into vwadd (just vsext/vzext +
vadd). Maybe we should? What complicates things slightly is that we don't
have an immediate variant of the widening insn, so would need to use the
vector-vector one.