Re: Re: [PATCH] RISC-V: Normalize user vsetvl intrinsics[PR112092]

2023-12-04 Thread Maciej W. Rozycki
On Wed, 8 Nov 2023, Kito Cheng wrote:

> OK, then LGTM, thanks for the explanation :)

 Please don't top-post on a GCC mailing list (and preferably in off-list 
replies to such mailing list messages unless it's been agreed to somehow 
with the participants), as it makes it difficult to make context replies.

 Best practice is to reply inline, quoting the relevant original paragraph 
(or enough context) referred to above, and with all the other parts of the 
message replied to discarded.  We may even have it written down somewhere 
(though I haven't checked; in the old days it used to be assumed), and I 
do hope any sane modern MUA can handle it.

 Otherwise the discussion thread quickly grows into an illegible mess.

 So this change does indeed fix PR 112092, however we now have an issue 
with several other test cases and the new `-mmovcc' option.  For example 
vsetvl-13.c fails with "-mmovcc -mbranch-cost=8" test options and assembly 
produced is like:

vsetvli a6,a6,e8,mf4,ta,ma
sneza5,a5
neg a5,a5
and a6,a5,a6
not a5,a5
andia5,a5,55
or  a5,a6,a5
beq a4,zero,.L10
li  a6,0
vsetvli zero,a5,e32,m1,tu,ma
.L4:
vle32.v v1,0(a0)
vle32.v v1,0(a1)
vle32.v v1,0(a2)
vse32.v v1,0(a3)
addia6,a6,1
bne a4,a6,.L4
.L10:
ret

As far as I can tell code produced is legitimate, and for the record 
analogous assembly is produced with `-march=rv32gcv_zicond' too:

vsetvli a6,a6,e8,mf4,ta,ma
czero.eqz   a6,a6,a5
li  a7,55
czero.nez   a5,a7,a5
or  a5,a5,a6
beq a4,zero,.L10
li  a6,0
vsetvli zero,a5,e32,m1,tu,ma
.L4:
vle32.v v1,0(a0)
vle32.v v1,0(a1)
vle32.v v1,0(a2)
vse32.v v1,0(a3)
addia6,a6,1
bne a4,a6,.L4
.L10:
ret

-- it's just that you can't see it with regression testing, because the 
test case overrides `-march='.  Presumably we do want to execute VSETVLI 
twice here on the basis that to avoid the second one by means of branches 
would be more costly than not to.

 Shall we just silence false failures like this with `-mno-movcc' then or 
shall we handle the conditional-move case somehow?

 For reference plain branched assembly is like:

li  a7,55
beq a5,zero,.L13
vsetvli zero,a6,e32,m1,tu,ma
.L2:
beq a4,zero,.L11
li  a5,0
.L4:
vle32.v v1,0(a0)
vle32.v v1,0(a1)
vle32.v v1,0(a2)
vse32.v v1,0(a3)
addia5,a5,1
bne a4,a5,.L4
.L11:
ret
.L13:
vsetvli zero,a7,e32,m1,tu,ma
j   .L2

  Maciej


Re: Re: [PATCH] RISC-V: Normalize user vsetvl intrinsics[PR112092]

2023-11-07 Thread Kito Cheng
On Wed, Nov 8, 2023 at 2:37 PM juzhe.zh...@rivai.ai
 wrote:
>
> Another question raise to me.
>
> Is it necessary we have such many variant of vsetvls?
>
> I am thinking about redesign:
>
> __riscv_vsetvl_e8mf8
> __riscv_vsetvl_e16mf4
> __riscv_vsetvl_e32mf2
> __riscv_vsetvl_e64m1
>
> They are quite redundant. They have the same result.
>
> May be just design as :
>
> __riscv_vsetvl_ratio64
>
> I am no proposing it since it has been used for a long time. Just raise my 
> concern.

Yeah, I agree those variant are just having same behavior even
semantic on the current intrinsic model, one reason is we don't have
smart vsetvli insertion pass at design stage, also it's more obviously
to user to pick the right vsetvli intrinsic, however I intend not to
change that interface, the reason is simple, it's used for a long time
as you mentioned, change that would be huge disturbance.

There may have same argument for vbool* stuffs, but vbool* kind of
mixing historical reason* and also we didn't found better way to model
that.

* We have define MLEN is v-spec long times ago, I forgot it's 0.7 or 0.8..


Re: Re: [PATCH] RISC-V: Normalize user vsetvl intrinsics[PR112092]

2023-11-07 Thread juzhe.zh...@rivai.ai
Another question raise to me.

Is it necessary we have such many variant of vsetvls?

I am thinking about redesign:

__riscv_vsetvl_e8mf8
__riscv_vsetvl_e16mf4
__riscv_vsetvl_e32mf2
__riscv_vsetvl_e64m1

They are quite redundant. They have the same result.

May be just design as :

__riscv_vsetvl_ratio64

I am no proposing it since it has been used for a long time. Just raise my 
concern.



juzhe.zh...@rivai.ai
 
From: Kito Cheng
Date: 2023-11-08 14:33
To: juzhe.zh...@rivai.ai
CC: gcc-patches; Kito.cheng; jeffreyalaw; Robin Dapp
Subject: Re: Re: [PATCH] RISC-V: Normalize user vsetvl intrinsics[PR112092]
OK, then LGTM, thanks for the explanation :)
 
On Wed, Nov 8, 2023 at 2:33 PM juzhe.zh...@rivai.ai
 wrote:
>
> More details:
>
> bb 1   bb 2
>   \/
>bb 3
>
> VSETVL PASS can only do VSETVL demand fusion, fuse demand from bb 3 to bb 1, 
> and fuse demand from bb 3 to bb2.
> We are not able to remove block bb 1 and bb 2 and create new bb 4 to hold the 
> vsetvl if bb 1 and bb 2 has the same vsetvl:
>
> bb 4 (new block)
>   |
> bb 3
>
> I don't think we should do this on VSETVL PASS.
> 
> juzhe.zh...@rivai.ai
>
>
> From: Kito Cheng
> Date: 2023-11-08 14:16
> To: Juzhe-Zhong
> CC: gcc-patches; kito.cheng; jeffreyalaw; rdapp.gcc
> Subject: Re: [PATCH] RISC-V: Normalize user vsetvl intrinsics[PR112092]
> I thought vsetvli insertion will try to merge them into one for those
> cases? Could you explain few more reasons why they are not fused now?
> Not an objection since I could imageing that would be easier to
> process, just wondering why.
>
> On Wed, Nov 8, 2023 at 2:11 PM Juzhe-Zhong  wrote:
> >
> > Since our user vsetvl intrinsics are defined as just calculate the VL output
> > which is the number of the elements to be processed. Such intrinsics do not
> > have any side effects.  We should normalize them when they have same ratio.
> >
> > E.g __riscv_vsetvl_e8mf8 result is same as __riscv_vsetvl_e64m1.
> >
> > Normalize them can allow us have better codegen.
> > Consider this following example:
> >
> > #include "riscv_vector.h"
> >
> > void foo(int32_t *in1, int32_t *in2, int32_t *in3, int32_t *out, size_t n, 
> > int cond, int avl) {
> >
> >   size_t vl;
> >   if (cond)
> > vl = __riscv_vsetvl_e32m1(avl);
> >   else
> > vl = __riscv_vsetvl_e16mf2(avl);
> >   for (size_t i = 0; i < n; i += 1) {
> > vint32m1_t a = __riscv_vle32_v_i32m1(in1, vl);
> > vint32m1_t b = __riscv_vle32_v_i32m1_tu(a, in2, vl);
> > vint32m1_t c = __riscv_vle32_v_i32m1_tu(b, in3, vl);
> > __riscv_vse32_v_i32m1(out, c, vl);
> >   }
> > }
> >
> > Before this patch:
> >
> > foo:
> > beq a5,zero,.L2
> > vsetvli a6,a6,e32,m1,tu,ma
> > .L3:
> > li  a5,0
> > beq a4,zero,.L9
> > .L4:
> > vle32.v v1,0(a0)
> > addia5,a5,1
> > vle32.v v1,0(a1)
> > vle32.v v1,0(a2)
> > vse32.v v1,0(a3)
> > bne a4,a5,.L4
> > .L9:
> > ret
> > .L2:
> > vsetvli zero,a6,e32,m1,tu,ma
> > j   .L3
> >
> > After this patch:
> >
> > foo:
> > li  a5,0
> > vsetvli zero,a6,e32,m1,tu,ma
> > beq a4,zero,.L9
> > .L4:
> > vle32.v v1,0(a0)
> > addia5,a5,1
> > vle32.v v1,0(a1)
> > vle32.v v1,0(a2)
> > vse32.v v1,0(a3)
> > bne a4,a5,.L4
> > .L9:
> > ret
> >
> > PR target/112092
> >
> > gcc/ChangeLog:
> >
> > * config/riscv/riscv-vector-builtins-bases.cc: Normalize the 
> > vsetvls.
> >
> > gcc/testsuite/ChangeLog:
> >
> > * gcc.target/riscv/rvv/vsetvl/pr109743-1.c: Adapt test.
> > * gcc.target/riscv/rvv/vsetvl/pr109743-3.c: Ditto.
> > * gcc.target/riscv/rvv/vsetvl/vsetvl-11.c: Ditto.
> > * gcc.target/riscv/rvv/vsetvl/vsetvl-15.c: Ditto.
> > * gcc.target/riscv/rvv/vsetvl/vsetvl-22.c: Ditto.
> > * gcc.target/riscv/rvv/vsetvl/vsetvlmax-13.c: Ditto.
> > * gcc.target/riscv/rvv/vsetvl/vsetvlmax-15.c: Ditto.
> > * gcc.target/riscv/rvv/vsetvl/vsetvlmax-5.c: Ditto.
> > * gcc.target/riscv/rvv/vsetvl/vsetvlmax-7.c: Ditto.
> > * gcc.target/riscv/rvv/vsetvl/vsetvlmax-8.c: Ditto.
> > * gcc.target/riscv/rvv/vsetvl/pr112092-1.c: New test.
> > * gcc.target/riscv/rvv/vsetvl/pr112092-2.c: New t

Re: Re: [PATCH] RISC-V: Normalize user vsetvl intrinsics[PR112092]

2023-11-07 Thread Kito Cheng
OK, then LGTM, thanks for the explanation :)

On Wed, Nov 8, 2023 at 2:33 PM juzhe.zh...@rivai.ai
 wrote:
>
> More details:
>
> bb 1   bb 2
>   \/
>bb 3
>
> VSETVL PASS can only do VSETVL demand fusion, fuse demand from bb 3 to bb 1, 
> and fuse demand from bb 3 to bb2.
> We are not able to remove block bb 1 and bb 2 and create new bb 4 to hold the 
> vsetvl if bb 1 and bb 2 has the same vsetvl:
>
> bb 4 (new block)
>   |
> bb 3
>
> I don't think we should do this on VSETVL PASS.
> 
> juzhe.zh...@rivai.ai
>
>
> From: Kito Cheng
> Date: 2023-11-08 14:16
> To: Juzhe-Zhong
> CC: gcc-patches; kito.cheng; jeffreyalaw; rdapp.gcc
> Subject: Re: [PATCH] RISC-V: Normalize user vsetvl intrinsics[PR112092]
> I thought vsetvli insertion will try to merge them into one for those
> cases? Could you explain few more reasons why they are not fused now?
> Not an objection since I could imageing that would be easier to
> process, just wondering why.
>
> On Wed, Nov 8, 2023 at 2:11 PM Juzhe-Zhong  wrote:
> >
> > Since our user vsetvl intrinsics are defined as just calculate the VL output
> > which is the number of the elements to be processed. Such intrinsics do not
> > have any side effects.  We should normalize them when they have same ratio.
> >
> > E.g __riscv_vsetvl_e8mf8 result is same as __riscv_vsetvl_e64m1.
> >
> > Normalize them can allow us have better codegen.
> > Consider this following example:
> >
> > #include "riscv_vector.h"
> >
> > void foo(int32_t *in1, int32_t *in2, int32_t *in3, int32_t *out, size_t n, 
> > int cond, int avl) {
> >
> >   size_t vl;
> >   if (cond)
> > vl = __riscv_vsetvl_e32m1(avl);
> >   else
> > vl = __riscv_vsetvl_e16mf2(avl);
> >   for (size_t i = 0; i < n; i += 1) {
> > vint32m1_t a = __riscv_vle32_v_i32m1(in1, vl);
> > vint32m1_t b = __riscv_vle32_v_i32m1_tu(a, in2, vl);
> > vint32m1_t c = __riscv_vle32_v_i32m1_tu(b, in3, vl);
> > __riscv_vse32_v_i32m1(out, c, vl);
> >   }
> > }
> >
> > Before this patch:
> >
> > foo:
> > beq a5,zero,.L2
> > vsetvli a6,a6,e32,m1,tu,ma
> > .L3:
> > li  a5,0
> > beq a4,zero,.L9
> > .L4:
> > vle32.v v1,0(a0)
> > addia5,a5,1
> > vle32.v v1,0(a1)
> > vle32.v v1,0(a2)
> > vse32.v v1,0(a3)
> > bne a4,a5,.L4
> > .L9:
> > ret
> > .L2:
> > vsetvli zero,a6,e32,m1,tu,ma
> > j   .L3
> >
> > After this patch:
> >
> > foo:
> > li  a5,0
> > vsetvli zero,a6,e32,m1,tu,ma
> > beq a4,zero,.L9
> > .L4:
> > vle32.v v1,0(a0)
> > addia5,a5,1
> > vle32.v v1,0(a1)
> > vle32.v v1,0(a2)
> > vse32.v v1,0(a3)
> > bne a4,a5,.L4
> > .L9:
> > ret
> >
> > PR target/112092
> >
> > gcc/ChangeLog:
> >
> > * config/riscv/riscv-vector-builtins-bases.cc: Normalize the 
> > vsetvls.
> >
> > gcc/testsuite/ChangeLog:
> >
> > * gcc.target/riscv/rvv/vsetvl/pr109743-1.c: Adapt test.
> > * gcc.target/riscv/rvv/vsetvl/pr109743-3.c: Ditto.
> > * gcc.target/riscv/rvv/vsetvl/vsetvl-11.c: Ditto.
> > * gcc.target/riscv/rvv/vsetvl/vsetvl-15.c: Ditto.
> > * gcc.target/riscv/rvv/vsetvl/vsetvl-22.c: Ditto.
> > * gcc.target/riscv/rvv/vsetvl/vsetvlmax-13.c: Ditto.
> > * gcc.target/riscv/rvv/vsetvl/vsetvlmax-15.c: Ditto.
> > * gcc.target/riscv/rvv/vsetvl/vsetvlmax-5.c: Ditto.
> > * gcc.target/riscv/rvv/vsetvl/vsetvlmax-7.c: Ditto.
> > * gcc.target/riscv/rvv/vsetvl/vsetvlmax-8.c: Ditto.
> > * gcc.target/riscv/rvv/vsetvl/pr112092-1.c: New test.
> > * gcc.target/riscv/rvv/vsetvl/pr112092-2.c: New test.
> >
> > ---
> >  .../riscv/riscv-vector-builtins-bases.cc  | 24 +-
> >  .../gcc.target/riscv/rvv/vsetvl/pr109743-1.c  |  2 +-
> >  .../gcc.target/riscv/rvv/vsetvl/pr109743-3.c  |  3 +--
> >  .../gcc.target/riscv/rvv/vsetvl/pr112092-1.c  | 25 +++
> >  .../gcc.target/riscv/rvv/vsetvl/pr112092-2.c  | 25 +++
> >  .../gcc.target/riscv/rvv/vsetvl/vsetvl-11.c   |  2 +-
> >  .../gcc.target/riscv/rvv/vsetvl/vsetvl-15.c   |  2 +-
> >  .../gcc.target/riscv/rvv/vsetvl/vsetvl-22.c   |  2 +-
> >  .../ri

Re: Re: [PATCH] RISC-V: Normalize user vsetvl intrinsics[PR112092]

2023-11-07 Thread juzhe.zh...@rivai.ai
More details:

bb 1   bb 2
  \/
   bb 3

VSETVL PASS can only do VSETVL demand fusion, fuse demand from bb 3 to bb 1, 
and fuse demand from bb 3 to bb2.
We are not able to remove block bb 1 and bb 2 and create new bb 4 to hold the 
vsetvl if bb 1 and bb 2 has the same vsetvl:

bb 4 (new block)
  |
bb 3

I don't think we should do this on VSETVL PASS.


juzhe.zh...@rivai.ai
 
From: Kito Cheng
Date: 2023-11-08 14:16
To: Juzhe-Zhong
CC: gcc-patches; kito.cheng; jeffreyalaw; rdapp.gcc
Subject: Re: [PATCH] RISC-V: Normalize user vsetvl intrinsics[PR112092]
I thought vsetvli insertion will try to merge them into one for those
cases? Could you explain few more reasons why they are not fused now?
Not an objection since I could imageing that would be easier to
process, just wondering why.
 
On Wed, Nov 8, 2023 at 2:11 PM Juzhe-Zhong  wrote:
>
> Since our user vsetvl intrinsics are defined as just calculate the VL output
> which is the number of the elements to be processed. Such intrinsics do not
> have any side effects.  We should normalize them when they have same ratio.
>
> E.g __riscv_vsetvl_e8mf8 result is same as __riscv_vsetvl_e64m1.
>
> Normalize them can allow us have better codegen.
> Consider this following example:
>
> #include "riscv_vector.h"
>
> void foo(int32_t *in1, int32_t *in2, int32_t *in3, int32_t *out, size_t n, 
> int cond, int avl) {
>
>   size_t vl;
>   if (cond)
> vl = __riscv_vsetvl_e32m1(avl);
>   else
> vl = __riscv_vsetvl_e16mf2(avl);
>   for (size_t i = 0; i < n; i += 1) {
> vint32m1_t a = __riscv_vle32_v_i32m1(in1, vl);
> vint32m1_t b = __riscv_vle32_v_i32m1_tu(a, in2, vl);
> vint32m1_t c = __riscv_vle32_v_i32m1_tu(b, in3, vl);
> __riscv_vse32_v_i32m1(out, c, vl);
>   }
> }
>
> Before this patch:
>
> foo:
> beq a5,zero,.L2
> vsetvli a6,a6,e32,m1,tu,ma
> .L3:
> li  a5,0
> beq a4,zero,.L9
> .L4:
> vle32.v v1,0(a0)
> addia5,a5,1
> vle32.v v1,0(a1)
> vle32.v v1,0(a2)
> vse32.v v1,0(a3)
> bne a4,a5,.L4
> .L9:
> ret
> .L2:
> vsetvli zero,a6,e32,m1,tu,ma
> j   .L3
>
> After this patch:
>
> foo:
> li  a5,0
> vsetvli zero,a6,e32,m1,tu,ma
> beq a4,zero,.L9
> .L4:
> vle32.v v1,0(a0)
> addia5,a5,1
> vle32.v v1,0(a1)
> vle32.v v1,0(a2)
> vse32.v v1,0(a3)
> bne a4,a5,.L4
> .L9:
> ret
>
> PR target/112092
>
> gcc/ChangeLog:
>
> * config/riscv/riscv-vector-builtins-bases.cc: Normalize the vsetvls.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/riscv/rvv/vsetvl/pr109743-1.c: Adapt test.
> * gcc.target/riscv/rvv/vsetvl/pr109743-3.c: Ditto.
> * gcc.target/riscv/rvv/vsetvl/vsetvl-11.c: Ditto.
> * gcc.target/riscv/rvv/vsetvl/vsetvl-15.c: Ditto.
> * gcc.target/riscv/rvv/vsetvl/vsetvl-22.c: Ditto.
> * gcc.target/riscv/rvv/vsetvl/vsetvlmax-13.c: Ditto.
> * gcc.target/riscv/rvv/vsetvl/vsetvlmax-15.c: Ditto.
> * gcc.target/riscv/rvv/vsetvl/vsetvlmax-5.c: Ditto.
> * gcc.target/riscv/rvv/vsetvl/vsetvlmax-7.c: Ditto.
> * gcc.target/riscv/rvv/vsetvl/vsetvlmax-8.c: Ditto.
> * gcc.target/riscv/rvv/vsetvl/pr112092-1.c: New test.
> * gcc.target/riscv/rvv/vsetvl/pr112092-2.c: New test.
>
> ---
>  .../riscv/riscv-vector-builtins-bases.cc  | 24 +-
>  .../gcc.target/riscv/rvv/vsetvl/pr109743-1.c  |  2 +-
>  .../gcc.target/riscv/rvv/vsetvl/pr109743-3.c  |  3 +--
>  .../gcc.target/riscv/rvv/vsetvl/pr112092-1.c  | 25 +++
>  .../gcc.target/riscv/rvv/vsetvl/pr112092-2.c  | 25 +++
>  .../gcc.target/riscv/rvv/vsetvl/vsetvl-11.c   |  2 +-
>  .../gcc.target/riscv/rvv/vsetvl/vsetvl-15.c   |  2 +-
>  .../gcc.target/riscv/rvv/vsetvl/vsetvl-22.c   |  2 +-
>  .../riscv/rvv/vsetvl/vsetvlmax-13.c   |  4 +--
>  .../riscv/rvv/vsetvl/vsetvlmax-15.c   |  6 ++---
>  .../gcc.target/riscv/rvv/vsetvl/vsetvlmax-5.c |  4 +--
>  .../gcc.target/riscv/rvv/vsetvl/vsetvlmax-7.c |  2 +-
>  .../gcc.target/riscv/rvv/vsetvl/vsetvlmax-8.c |  4 +--
>  13 files changed, 83 insertions(+), 22 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/vsetvl/pr112092-1.c
>  create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/vsetvl/pr112092-2.c
>
> diff --git a/gcc/config/riscv/riscv-vector-builtins-bases.cc 
> b/gcc/config/riscv/riscv-vector-builtins-bases.cc
> index 0298b7987a1..d70468542ee 100644
> --- a/gcc/config/riscv/riscv-vector-builtins-bases.cc
> +++ b/gcc/config/riscv/riscv-ve

Re: Re: [PATCH] RISC-V: Normalize user vsetvl intrinsics[PR112092]

2023-11-07 Thread juzhe.zh...@rivai.ai
before VSETVL PASS. The code is as follows:

bb 1:
vsetvli e16mf2 -> set a6
bb 2:
vsetvli e32m1 -> set a6
bb 3:
...
vle (use a6) e32m1 TU
vle (use a6) e32m1 TU
vse (use a6) e32m1 TU

VSETVL PASS only do  VSETVL information fusion, it doesn't do the CFG block 
fusion.

VSETVL PASS succeed on following fusion:

Change bb 1 vsetvli e16mf2 -> e32m1TU
Change bb 2 vsetvli e32m1 -> e32m1TU

But VSETVL pass can't remove bb1 and bb2, can create a new block said bb 4 to 
hold vsetvli e32m1TU

So you will see:
bb 1:
vsetvli e32m1TU
bb 2:
vsetvli e32m1TU
bb 3:
...
vle
vle
vse

with this patch, since vsetvl e16mf2 and vsetvl e32m1 are normalized into same 
vsetvl e8mf4
Then, the before the VSETVL PASS, we will see:

bb 1
vsetvli e8mf4
bb 2:
...
vle
vle
vse

Since the later vle/vle/vse is using e32m1TU, then VSETVL fuse them into bb1 
change vsetvli e8mf4 into:

bb 1
vsetvli e32m1TU
bb 2:
...
vle
vle
vse


juzhe.zh...@rivai.ai
 
From: Kito Cheng
Date: 2023-11-08 14:16
To: Juzhe-Zhong
CC: gcc-patches; kito.cheng; jeffreyalaw; rdapp.gcc
Subject: Re: [PATCH] RISC-V: Normalize user vsetvl intrinsics[PR112092]
I thought vsetvli insertion will try to merge them into one for those
cases? Could you explain few more reasons why they are not fused now?
Not an objection since I could imageing that would be easier to
process, just wondering why.
 
On Wed, Nov 8, 2023 at 2:11 PM Juzhe-Zhong  wrote:
>
> Since our user vsetvl intrinsics are defined as just calculate the VL output
> which is the number of the elements to be processed. Such intrinsics do not
> have any side effects.  We should normalize them when they have same ratio.
>
> E.g __riscv_vsetvl_e8mf8 result is same as __riscv_vsetvl_e64m1.
>
> Normalize them can allow us have better codegen.
> Consider this following example:
>
> #include "riscv_vector.h"
>
> void foo(int32_t *in1, int32_t *in2, int32_t *in3, int32_t *out, size_t n, 
> int cond, int avl) {
>
>   size_t vl;
>   if (cond)
> vl = __riscv_vsetvl_e32m1(avl);
>   else
> vl = __riscv_vsetvl_e16mf2(avl);
>   for (size_t i = 0; i < n; i += 1) {
> vint32m1_t a = __riscv_vle32_v_i32m1(in1, vl);
> vint32m1_t b = __riscv_vle32_v_i32m1_tu(a, in2, vl);
> vint32m1_t c = __riscv_vle32_v_i32m1_tu(b, in3, vl);
> __riscv_vse32_v_i32m1(out, c, vl);
>   }
> }
>
> Before this patch:
>
> foo:
> beq a5,zero,.L2
> vsetvli a6,a6,e32,m1,tu,ma
> .L3:
> li  a5,0
> beq a4,zero,.L9
> .L4:
> vle32.v v1,0(a0)
> addia5,a5,1
> vle32.v v1,0(a1)
> vle32.v v1,0(a2)
> vse32.v v1,0(a3)
> bne a4,a5,.L4
> .L9:
> ret
> .L2:
> vsetvli zero,a6,e32,m1,tu,ma
> j   .L3
>
> After this patch:
>
> foo:
> li  a5,0
> vsetvli zero,a6,e32,m1,tu,ma
> beq a4,zero,.L9
> .L4:
> vle32.v v1,0(a0)
> addia5,a5,1
> vle32.v v1,0(a1)
> vle32.v v1,0(a2)
> vse32.v v1,0(a3)
> bne a4,a5,.L4
> .L9:
> ret
>
> PR target/112092
>
> gcc/ChangeLog:
>
> * config/riscv/riscv-vector-builtins-bases.cc: Normalize the vsetvls.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/riscv/rvv/vsetvl/pr109743-1.c: Adapt test.
> * gcc.target/riscv/rvv/vsetvl/pr109743-3.c: Ditto.
> * gcc.target/riscv/rvv/vsetvl/vsetvl-11.c: Ditto.
> * gcc.target/riscv/rvv/vsetvl/vsetvl-15.c: Ditto.
> * gcc.target/riscv/rvv/vsetvl/vsetvl-22.c: Ditto.
> * gcc.target/riscv/rvv/vsetvl/vsetvlmax-13.c: Ditto.
> * gcc.target/riscv/rvv/vsetvl/vsetvlmax-15.c: Ditto.
> * gcc.target/riscv/rvv/vsetvl/vsetvlmax-5.c: Ditto.
> * gcc.target/riscv/rvv/vsetvl/vsetvlmax-7.c: Ditto.
> * gcc.target/riscv/rvv/vsetvl/vsetvlmax-8.c: Ditto.
> * gcc.target/riscv/rvv/vsetvl/pr112092-1.c: New test.
> * gcc.target/riscv/rvv/vsetvl/pr112092-2.c: New test.
>
> ---
>  .../riscv/riscv-vector-builtins-bases.cc  | 24 +-
>  .../gcc.target/riscv/rvv/vsetvl/pr109743-1.c  |  2 +-
>  .../gcc.target/riscv/rvv/vsetvl/pr109743-3.c  |  3 +--
>  .../gcc.target/riscv/rvv/vsetvl/pr112092-1.c  | 25 +++
>  .../gcc.target/riscv/rvv/vsetvl/pr112092-2.c  | 25 +++
>  .../gcc.target/riscv/rvv/vsetvl/vsetvl-11.c   |  2 +-
>  .../gcc.target/riscv/rvv/vsetvl/vsetvl-15.c   |  2 +-
>  .../gcc.target/riscv/rvv/vsetvl/vsetvl-22.c   |  2 +-
>  .../riscv/rvv/vsetvl/vsetvlmax-13.c   |  4 +--
>  .../riscv/rvv/vsetvl/vsetvlmax-15.c   |  6 ++---
>  .../gcc.target/riscv/rvv/vsetvl/vsetvlmax-5.c |  4 +--
>  .../gcc.target/riscv/rvv/vsetvl/vsetvl

Re: [PATCH] RISC-V: Normalize user vsetvl intrinsics[PR112092]

2023-11-07 Thread Kito Cheng
I thought vsetvli insertion will try to merge them into one for those
cases? Could you explain few more reasons why they are not fused now?
Not an objection since I could imageing that would be easier to
process, just wondering why.

On Wed, Nov 8, 2023 at 2:11 PM Juzhe-Zhong  wrote:
>
> Since our user vsetvl intrinsics are defined as just calculate the VL output
> which is the number of the elements to be processed. Such intrinsics do not
> have any side effects.  We should normalize them when they have same ratio.
>
> E.g __riscv_vsetvl_e8mf8 result is same as __riscv_vsetvl_e64m1.
>
> Normalize them can allow us have better codegen.
> Consider this following example:
>
> #include "riscv_vector.h"
>
> void foo(int32_t *in1, int32_t *in2, int32_t *in3, int32_t *out, size_t n, 
> int cond, int avl) {
>
>   size_t vl;
>   if (cond)
> vl = __riscv_vsetvl_e32m1(avl);
>   else
> vl = __riscv_vsetvl_e16mf2(avl);
>   for (size_t i = 0; i < n; i += 1) {
> vint32m1_t a = __riscv_vle32_v_i32m1(in1, vl);
> vint32m1_t b = __riscv_vle32_v_i32m1_tu(a, in2, vl);
> vint32m1_t c = __riscv_vle32_v_i32m1_tu(b, in3, vl);
> __riscv_vse32_v_i32m1(out, c, vl);
>   }
> }
>
> Before this patch:
>
> foo:
> beq a5,zero,.L2
> vsetvli a6,a6,e32,m1,tu,ma
> .L3:
> li  a5,0
> beq a4,zero,.L9
> .L4:
> vle32.v v1,0(a0)
> addia5,a5,1
> vle32.v v1,0(a1)
> vle32.v v1,0(a2)
> vse32.v v1,0(a3)
> bne a4,a5,.L4
> .L9:
> ret
> .L2:
> vsetvli zero,a6,e32,m1,tu,ma
> j   .L3
>
> After this patch:
>
> foo:
> li  a5,0
> vsetvli zero,a6,e32,m1,tu,ma
> beq a4,zero,.L9
> .L4:
> vle32.v v1,0(a0)
> addia5,a5,1
> vle32.v v1,0(a1)
> vle32.v v1,0(a2)
> vse32.v v1,0(a3)
> bne a4,a5,.L4
> .L9:
> ret
>
> PR target/112092
>
> gcc/ChangeLog:
>
> * config/riscv/riscv-vector-builtins-bases.cc: Normalize the vsetvls.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/riscv/rvv/vsetvl/pr109743-1.c: Adapt test.
> * gcc.target/riscv/rvv/vsetvl/pr109743-3.c: Ditto.
> * gcc.target/riscv/rvv/vsetvl/vsetvl-11.c: Ditto.
> * gcc.target/riscv/rvv/vsetvl/vsetvl-15.c: Ditto.
> * gcc.target/riscv/rvv/vsetvl/vsetvl-22.c: Ditto.
> * gcc.target/riscv/rvv/vsetvl/vsetvlmax-13.c: Ditto.
> * gcc.target/riscv/rvv/vsetvl/vsetvlmax-15.c: Ditto.
> * gcc.target/riscv/rvv/vsetvl/vsetvlmax-5.c: Ditto.
> * gcc.target/riscv/rvv/vsetvl/vsetvlmax-7.c: Ditto.
> * gcc.target/riscv/rvv/vsetvl/vsetvlmax-8.c: Ditto.
> * gcc.target/riscv/rvv/vsetvl/pr112092-1.c: New test.
> * gcc.target/riscv/rvv/vsetvl/pr112092-2.c: New test.
>
> ---
>  .../riscv/riscv-vector-builtins-bases.cc  | 24 +-
>  .../gcc.target/riscv/rvv/vsetvl/pr109743-1.c  |  2 +-
>  .../gcc.target/riscv/rvv/vsetvl/pr109743-3.c  |  3 +--
>  .../gcc.target/riscv/rvv/vsetvl/pr112092-1.c  | 25 +++
>  .../gcc.target/riscv/rvv/vsetvl/pr112092-2.c  | 25 +++
>  .../gcc.target/riscv/rvv/vsetvl/vsetvl-11.c   |  2 +-
>  .../gcc.target/riscv/rvv/vsetvl/vsetvl-15.c   |  2 +-
>  .../gcc.target/riscv/rvv/vsetvl/vsetvl-22.c   |  2 +-
>  .../riscv/rvv/vsetvl/vsetvlmax-13.c   |  4 +--
>  .../riscv/rvv/vsetvl/vsetvlmax-15.c   |  6 ++---
>  .../gcc.target/riscv/rvv/vsetvl/vsetvlmax-5.c |  4 +--
>  .../gcc.target/riscv/rvv/vsetvl/vsetvlmax-7.c |  2 +-
>  .../gcc.target/riscv/rvv/vsetvl/vsetvlmax-8.c |  4 +--
>  13 files changed, 83 insertions(+), 22 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/vsetvl/pr112092-1.c
>  create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/vsetvl/pr112092-2.c
>
> diff --git a/gcc/config/riscv/riscv-vector-builtins-bases.cc 
> b/gcc/config/riscv/riscv-vector-builtins-bases.cc
> index 0298b7987a1..d70468542ee 100644
> --- a/gcc/config/riscv/riscv-vector-builtins-bases.cc
> +++ b/gcc/config/riscv/riscv-vector-builtins-bases.cc
> @@ -131,19 +131,31 @@ public:
>
>  tree type = builtin_types[e.type.index].vector;
>  machine_mode mode = TYPE_MODE (type);
> -machine_mode inner_mode = GET_MODE_INNER (mode);
> +/* Normalize same RATO (SEW/LMUL) into same vsetvl instruction.
> +
> +- e8,mf8/e16,mf4/e32,mf2/e64,m1 --> e8mf8
> +- e8,mf4/e16,mf2/e32,m1/e64,m2  --> e8mf4
> +- e8,mf2/e16,m1/e32,m2/e64,m4   --> e8mf2
> +- e8,m1/e16,m2/e32,m4/e64,m8--> e8m1
> +- e8,m2/e16,m4/e32,m8   --> e8m2
> +- e8,m4/e16,m8  --> e8m4
> +- e8,m8 --> e8m8
> +*/
>  /* SEW.  */
> -e.add_input_operand (Pmode,
> -gen_int_mode (GET_MODE_BITSIZE (inner_mode), Pmode));
> +e.add_input_operand (Pmode, gen_int_mode (8, Pmode));
>
>  /* LMUL.  */
> -