https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123870

--- Comment #25 from Robin Dapp <rdapp at gcc dot gnu.org> ---
(In reply to Ilya Kurdyukov from comment #24)
> I apologize that a bug opened on one topic turned into a bunch of reports,
> but I want to ask another question.
> 
> Why does GCC insert a copy instruction for vwaddu when vd = vs2? Does this
> for both RVV 1.0 and xtheadvector.
> 
>       vmv1r.v v3,v1
>       vwaddu.wv       v1,v3,v2
> 
> Is using the same register prohibited?
> 
>       vwaddu.wv       v1,v1,v2
> 
> Is this a missed optimization or a hardware limitation?
> 
> Example:
> 
> #include <riscv_vector.h>
> 
> void test(uint8_t *src, uint16_t *dst) {
>       vuint16m1_t vsum = __riscv_vmv_v_x_u16m1(0, 8);
>       for (int i = 0; i < 8; i++) {
>               vuint8mf2_t h0 = __riscv_vle8_v_u8mf2(src + i * 8, 8);
>               vsum = __riscv_vwaddu_wv_u16m1(vsum, h0, 8);
>       }
>       __riscv_vse16_v_u16m1(dst, vsum, 8);
> }
> 
> $ gcc-16 -march=rv64gcv test.c -O2 -S
> 
> ...
> 
>       vsetivli        zero,8,e16,m1,ta,ma
>       vmv.v.i v1,0
>       addi    a5,a0,64
>       vsetvli zero,zero,e8,mf2,ta,ma
> .L2:
>       vle8.v  v2,0(a0)
>       vmv1r.v v3,v1
>       addi    a0,a0,8
>       vwaddu.wv       v1,v3,v2
>       bne     a0,a5,.L2
>       vse16.v v1,0(a1)
>       ret

See
https://github.com/riscvarchive/riscv-v-spec/blob/master/v-spec.adoc#52-vector-operands


A destination vector register group can overlap a source vector register group
only if one of the following holds:

-    The destination EEW equals the source EEW.

-    The destination EEW is smaller than the source EEW and the overlap is in
the lowest-numbered part of the source register group (e.g., when LMUL=1,
vnsrl.wi v0, v0, 3 is legal, but a destination of v1 is not).

-    The destination EEW is greater than the source EEW, the source EMUL is at
least 1, and the overlap is in the highest-numbered part of the destination
register group (e.g., when LMUL=8, vzext.vf4 v0, v6 is legal, but a source of
v0, v2, or v4 is not).

We have a missed optimization in that space (the overlap for LMUL > 1 we don't
yet allow) but your case is not permitted.  If you used LMUL=1 for the source,
which might be advantageous if the loop is load bound, it would be (but a
missed optimization).

Reply via email to