https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123870
--- Comment #25 from Robin Dapp <rdapp at gcc dot gnu.org> ---
(In reply to Ilya Kurdyukov from comment #24)
> I apologize that a bug opened on one topic turned into a bunch of reports,
> but I want to ask another question.
>
> Why does GCC insert a copy instruction for vwaddu when vd = vs2? Does this
> for both RVV 1.0 and xtheadvector.
>
> vmv1r.v v3,v1
> vwaddu.wv v1,v3,v2
>
> Is using the same register prohibited?
>
> vwaddu.wv v1,v1,v2
>
> Is this a missed optimization or a hardware limitation?
>
> Example:
>
> #include <riscv_vector.h>
>
> void test(uint8_t *src, uint16_t *dst) {
> vuint16m1_t vsum = __riscv_vmv_v_x_u16m1(0, 8);
> for (int i = 0; i < 8; i++) {
> vuint8mf2_t h0 = __riscv_vle8_v_u8mf2(src + i * 8, 8);
> vsum = __riscv_vwaddu_wv_u16m1(vsum, h0, 8);
> }
> __riscv_vse16_v_u16m1(dst, vsum, 8);
> }
>
> $ gcc-16 -march=rv64gcv test.c -O2 -S
>
> ...
>
> vsetivli zero,8,e16,m1,ta,ma
> vmv.v.i v1,0
> addi a5,a0,64
> vsetvli zero,zero,e8,mf2,ta,ma
> .L2:
> vle8.v v2,0(a0)
> vmv1r.v v3,v1
> addi a0,a0,8
> vwaddu.wv v1,v3,v2
> bne a0,a5,.L2
> vse16.v v1,0(a1)
> ret
See
https://github.com/riscvarchive/riscv-v-spec/blob/master/v-spec.adoc#52-vector-operands
A destination vector register group can overlap a source vector register group
only if one of the following holds:
- The destination EEW equals the source EEW.
- The destination EEW is smaller than the source EEW and the overlap is in
the lowest-numbered part of the source register group (e.g., when LMUL=1,
vnsrl.wi v0, v0, 3 is legal, but a destination of v1 is not).
- The destination EEW is greater than the source EEW, the source EMUL is at
least 1, and the overlap is in the highest-numbered part of the destination
register group (e.g., when LMUL=8, vzext.vf4 v0, v6 is legal, but a source of
v0, v2, or v4 is not).
We have a missed optimization in that space (the overlap for LMUL > 1 we don't
yet allow) but your case is not permitted. If you used LMUL=1 for the source,
which might be advantageous if the loop is load bound, it would be (but a
missed optimization).