https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122277
--- Comment #5 from Robin Dapp <rdapp at gcc dot gnu.org> ---
> but now we do, and single-lane RVVM1SI looks better from your
> cost model. I believe that before my patch we never tried this
> because there's a conversion around the reduction and
Having another quick look, I think the V4QI/VLS loop code is worse.
Its costs are lower but for VF=1 instead of VF=4, so of course the
VLA loop is preferred.
When forcing the vector mode (--param=riscv-autovec-mode=V4QI) I can see
several broadcast "permutes" like:
vect__59.11_427 = {_428, 0, 0, 0};
vect__59.12_425 = VEC_PERM_EXPR <vect__59.11_427, vect__59.11_427, { 0, 0, 0,
0 }>;
We don't handle those in the perm_const hook yet. So costing is surely not
ideal, but rather in the wrong direction ;)
And, indeed, no unrolling but just a single iteration. Unrolling then happens
later, explicitly.
Also, a number of other permutes that weren't there before.
Need to have a closer look tomorrow.