Re: [PATCH V2] RISC-V: Fix failed hoist in LICM of vmv.v.x instruction

2023-10-19 Thread Lehua Ding
Committed after the commited of the vsetvl pass refactor patch, thanks 
Robin.


On 2023/10/19 16:43, Robin Dapp wrote:

Hi Juzhe,

as discussed off-list this approach generally makes sense to me so
the patch LGTM once the vsetvl rework is upstream and settled.

Independently, we still need to understand why the more complex
broadcast pattern is not hoisted out of the loop.

Regards
  Robin


--
Best,
Lehua (RiVAI)
lehua.d...@rivai.ai



Re: Re: [PATCH V2] RISC-V: Fix failed hoist in LICM of vmv.v.x instruction

2023-10-19 Thread 钟居哲
May be it is COST issue of RVV instruction ?

  /* TODO: We set RVV instruction cost as 1 by default.
 Cost Model need to be well analyzed and supported in the future. */
  if (riscv_v_ext_mode_p (mode))
{
  *total = COSTS_N_INSNS (1);
  return true;
}

Since all RVV instructions are considered as very cheap (COST = 1).


juzhe.zh...@rivai.ai
 
From: Robin Dapp
Date: 2023-10-19 16:43
To: Juzhe-Zhong; gcc-patches
CC: rdapp.gcc; kito.cheng; kito.cheng; jeffreyalaw
Subject: Re: [PATCH V2] RISC-V: Fix failed hoist in LICM of vmv.v.x instruction
Hi Juzhe,
 
as discussed off-list this approach generally makes sense to me so
the patch LGTM once the vsetvl rework is upstream and settled.
 
Independently, we still need to understand why the more complex
broadcast pattern is not hoisted out of the loop.
 
Regards
Robin
 


Re: [PATCH V2] RISC-V: Fix failed hoist in LICM of vmv.v.x instruction

2023-10-19 Thread Robin Dapp
Hi Juzhe,

as discussed off-list this approach generally makes sense to me so
the patch LGTM once the vsetvl rework is upstream and settled.

Independently, we still need to understand why the more complex
broadcast pattern is not hoisted out of the loop.

Regards
 Robin


Re: [PATCH V2] RISC-V: Fix failed hoist in LICM of vmv.v.x instruction

2023-10-18 Thread juzhe.zh...@rivai.ai
More details of VSETVL bug:

Loop:
   10ddc:   9ed030d7vmv1r.v v1,v13
   10de0:   b21040d7vncvt.x.x.w v1,v1
   10de4:   5e0785d7vmv.v.v v11,v15
   10de8:   b700a5d7vmacc.vvv11,v1,v16
   10dec:   a6e8a0d7vmadd.vvv1,v17,v14
   10df0:   26b7b5d7vand.vi v11,v11,15
   10df4:   0c75f7d7vsetvli a5,a1,e8,mf2,ta,ma
   10df8:   0c707557vsetvli a0,zero,e8,mf2,ta,ma
   10dfc:   2617b0d7vand.vi v1,v1,15
   10e00:   0c75f057vsetvli zero,a1,e8,mf2,ta,ma
   10e04:   8d9dsub a1,a1,a5
   10e06:   020705a7vse8.v  v11,(a4)
   10e0a:   0c77f057vsetvli zero,a5,e8,mf2,ta,ma
   10e0e:   020685a7vse8.v  v11,(a3)
   10e12:   020600a7vse8.v  v1,(a2)
   10e16:   973eadd a4,a4,a5
   10e18:   0c807557vsetvli a0,zero,e16,m1,ta,ma
   10e1c:   96beadd a3,a3,a5
   10e1e:   963eadd a2,a2,a5
   10e20:   02d606d7vadd.vv v13,v13,v12
   10e24:   fdc5bneza1,10ddc 

The vncvt.x.x.w consume e16m1 VTYPE vsetvl but it shouldn't, it should be e8mf2.
This issue is fixed by recent refactor patch.


juzhe.zh...@rivai.ai
 
From: Juzhe-Zhong
Date: 2023-10-18 18:25
To: gcc-patches
CC: kito.cheng; kito.cheng; jeffreyalaw; rdapp.gcc; Juzhe-Zhong
Subject: [PATCH V2] RISC-V: Fix failed hoist in LICM of vmv.v.x instruction
Confirm dynamic LMUL algorithm works well for choosing LMUL = 4 for the PR:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111848
 
But it generate horrible register spillings.
 
The root cause is that we didn't hoist the vmv.v.x outside the loop which
increase the SLP loop register pressure.
 
So, change the COSNT_VECTOR move into vec_duplicate splitter that we can gain 
better optimizations:
 
1. better LICM.
2. More opportunities of transforming 'vv' into 'vx' in the future.
 
Before this patch:
 
f3:
ble a4,zero,.L8
csrrt0,vlenb
sllit1,t0,4
csrra6,vlenb
sub sp,sp,t1
csrra5,vlenb
sllia6,a6,3
sllia5,a5,2
add a6,a6,sp
vsetvli a7,zero,e16,m8,ta,ma
sllia4,a4,3
vid.v   v8
addit6,a5,-1
vand.vi v8,v8,-2
neg t5,a5
vs8r.v  v8,0(sp)
vadd.vi v8,v8,1
vs8r.v  v8,0(a6)
j   .L4
.L12:
vsetvli a7,zero,e16,m8,ta,ma
.L4:
csrrt0,vlenb
sllit0,t0,3
vl8re16.v   v16,0(sp)
add t0,t0,sp
vmv.v.x v8,t6
mv  t1,a4
vand.vv v24,v16,v8
mv  a6,a4
vl8re16.v   v16,0(t0)
vand.vv v8,v16,v8
bleua4,a5,.L3
mv  a6,a5
.L3:
vsetvli zero,a6,e8,m4,ta,ma
vle8.v  v20,0(a2)
vle8.v  v16,0(a3)
vsetvli a7,zero,e8,m4,ta,ma
vrgatherei16.vv v4,v20,v24
vadd.vv v4,v16,v4
vsetvli zero,a6,e8,m4,ta,ma
vse8.v  v4,0(a0)
vle8.v  v20,0(a2)
vsetvli a7,zero,e8,m4,ta,ma
vrgatherei16.vv v4,v20,v8
vadd.vv v4,v4,v16
vsetvli zero,a6,e8,m4,ta,ma
vse8.v  v4,0(a1)
add a4,a4,t5
add a0,a0,a5
add a3,a3,a5
add a1,a1,a5
add a2,a2,a5
bgtut1,a5,.L12
csrrt0,vlenb
sllit1,t0,4
add sp,sp,t1
jr  ra
.L8:
ret
 
After this patch:
 
f3:
ble a4,zero,.L6
csrr a6,vlenb
csrr a5,vlenb
slli a6,a6,2
slli a5,a5,2
addi a6,a6,-1
slli a4,a4,3
neg t5,a5
vsetvli t1,zero,e16,m8,ta,ma
vmv.v.x v24,a6
vid.v v8
vand.vi v8,v8,-2
vadd.vi v16,v8,1
vand.vv v8,v8,v24
vand.vv v16,v16,v24
.L4:
mv t1,a4
mv a6,a4
bleu a4,a5,.L3
mv a6,a5
.L3:
vsetvli zero,a6,e8,m4,ta,ma
vle8.v v28,0(a2)
vle8.v v24,0(a3)
vsetvli a7,zero,e8,m4,ta,ma
vrgatherei16.vv v4,v28,v8
vadd.vv v4,v24,v4
vsetvli zero,a6,e8,m4,ta,ma
vse8.v v4,0(a0)
vle8.v v28,0(a2)
vsetvli a7,zero,e8,m4,ta,ma
vrgatherei16.vv v4,v28,v16
vadd.vv v4,v4,v24
vsetvli zero,a6,e8,m4,ta,ma
vse8.v v4,0(a1)
add a4,a4,t5
add a0,a0,a5
add a3,a3,a5
add a1,a1,a5
add a2,a2,a5
bgtu t1,a5,.L4
.L6:
ret
 
Note that this patch triggers multiple FAILs:
FAIL: gcc.target/riscv/rvv/autovec/cond/cond_arith_run-3.c execution test
FAIL: gcc.target/riscv/rvv/autovec/cond/cond_arith_run-3.c execution test
FAIL: gcc.target/riscv/rvv/autovec/cond/cond_arith_run-4.c execution test
FAIL: gcc.target/riscv/rvv/autovec/cond/cond_arith_run-4.c execution test
FAIL: gcc.target/riscv/rvv/autovec/cond/cond_arith_run-8.c execution test
FAIL: gcc.target/riscv/rvv/autovec/cond/cond_arith_run-8.c execution test
FAIL: gcc.target/riscv/rvv/autovec/gather-scatter/strided_load_run-1.c 
execution test
FAIL

[PATCH V2] RISC-V: Fix failed hoist in LICM of vmv.v.x instruction

2023-10-18 Thread Juzhe-Zhong
Confirm dynamic LMUL algorithm works well for choosing LMUL = 4 for the PR:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111848

But it generate horrible register spillings.

The root cause is that we didn't hoist the vmv.v.x outside the loop which
increase the SLP loop register pressure.

So, change the COSNT_VECTOR move into vec_duplicate splitter that we can gain 
better optimizations:

1. better LICM.
2. More opportunities of transforming 'vv' into 'vx' in the future.

Before this patch:

f3:
ble a4,zero,.L8
csrrt0,vlenb
sllit1,t0,4
csrra6,vlenb
sub sp,sp,t1
csrra5,vlenb
sllia6,a6,3
sllia5,a5,2
add a6,a6,sp
vsetvli a7,zero,e16,m8,ta,ma
sllia4,a4,3
vid.v   v8
addit6,a5,-1
vand.vi v8,v8,-2
neg t5,a5
vs8r.v  v8,0(sp)
vadd.vi v8,v8,1
vs8r.v  v8,0(a6)
j   .L4
.L12:
vsetvli a7,zero,e16,m8,ta,ma
.L4:
csrrt0,vlenb
sllit0,t0,3
vl8re16.v   v16,0(sp)
add t0,t0,sp
vmv.v.x v8,t6
mv  t1,a4
vand.vv v24,v16,v8
mv  a6,a4
vl8re16.v   v16,0(t0)
vand.vv v8,v16,v8
bleua4,a5,.L3
mv  a6,a5
.L3:
vsetvli zero,a6,e8,m4,ta,ma
vle8.v  v20,0(a2)
vle8.v  v16,0(a3)
vsetvli a7,zero,e8,m4,ta,ma
vrgatherei16.vv v4,v20,v24
vadd.vv v4,v16,v4
vsetvli zero,a6,e8,m4,ta,ma
vse8.v  v4,0(a0)
vle8.v  v20,0(a2)
vsetvli a7,zero,e8,m4,ta,ma
vrgatherei16.vv v4,v20,v8
vadd.vv v4,v4,v16
vsetvli zero,a6,e8,m4,ta,ma
vse8.v  v4,0(a1)
add a4,a4,t5
add a0,a0,a5
add a3,a3,a5
add a1,a1,a5
add a2,a2,a5
bgtut1,a5,.L12
csrrt0,vlenb
sllit1,t0,4
add sp,sp,t1
jr  ra
.L8:
ret

After this patch:

f3:
ble a4,zero,.L6
csrra6,vlenb
csrra5,vlenb
sllia6,a6,2
sllia5,a5,2
addia6,a6,-1
sllia4,a4,3
neg t5,a5
vsetvli t1,zero,e16,m8,ta,ma
vmv.v.x v24,a6
vid.v   v8
vand.vi v8,v8,-2
vadd.vi v16,v8,1
vand.vv v8,v8,v24
vand.vv v16,v16,v24
.L4:
mv  t1,a4
mv  a6,a4
bleua4,a5,.L3
mv  a6,a5
.L3:
vsetvli zero,a6,e8,m4,ta,ma
vle8.v  v28,0(a2)
vle8.v  v24,0(a3)
vsetvli a7,zero,e8,m4,ta,ma
vrgatherei16.vv v4,v28,v8
vadd.vv v4,v24,v4
vsetvli zero,a6,e8,m4,ta,ma
vse8.v  v4,0(a0)
vle8.v  v28,0(a2)
vsetvli a7,zero,e8,m4,ta,ma
vrgatherei16.vv v4,v28,v16
vadd.vv v4,v4,v24
vsetvli zero,a6,e8,m4,ta,ma
vse8.v  v4,0(a1)
add a4,a4,t5
add a0,a0,a5
add a3,a3,a5
add a1,a1,a5
add a2,a2,a5
bgtut1,a5,.L4
.L6:
ret

Note that this patch triggers multiple FAILs:
FAIL: gcc.target/riscv/rvv/autovec/cond/cond_arith_run-3.c execution test
FAIL: gcc.target/riscv/rvv/autovec/cond/cond_arith_run-3.c execution test
FAIL: gcc.target/riscv/rvv/autovec/cond/cond_arith_run-4.c execution test
FAIL: gcc.target/riscv/rvv/autovec/cond/cond_arith_run-4.c execution test
FAIL: gcc.target/riscv/rvv/autovec/cond/cond_arith_run-8.c execution test
FAIL: gcc.target/riscv/rvv/autovec/cond/cond_arith_run-8.c execution test
FAIL: gcc.target/riscv/rvv/autovec/gather-scatter/strided_load_run-1.c 
execution test
FAIL: gcc.target/riscv/rvv/autovec/gather-scatter/strided_load_run-1.c 
execution test
FAIL: gcc.target/riscv/rvv/autovec/gather-scatter/strided_load_run-2.c 
execution test
FAIL: gcc.target/riscv/rvv/autovec/gather-scatter/strided_load_run-2.c 
execution test
FAIL: gcc.target/riscv/rvv/autovec/gather-scatter/strided_store_run-1.c 
execution test
FAIL: gcc.target/riscv/rvv/autovec/gather-scatter/strided_store_run-1.c 
execution test
FAIL: gcc.target/riscv/rvv/autovec/gather-scatter/strided_store_run-2.c 
execution test
FAIL: gcc.target/riscv/rvv/autovec/gather-scatter/strided_store_run-2.c 
execution test

They failed are all because of bugs on VSETVL PASS:

10dd4:   0c707057vsetvli zero,zero,e8,mf2,ta,ma
   10dd8:   5e06b8d7vmv.v.i v17,13
   10ddc:   9ed030d7vmv1r.v v1,v13
   10de0:   b21040d7vncvt.x.x.w v1,v1   > 
raise illegal instruction since we don't have SEW = 8 -> SEW = 4 narrowing.
   10de4:   5e0785d7vmv.v.v v11,v15

Confirm the recent VSETVL refactor patch: 
https://gcc.gnu.org/pipermail/gcc-patches/2023-October/633231.html fixed all of 
them.

So this patch should be committed after the VSETVL refactor patch.

PR target/111848