Thanks Richard.

Let me clarify again to make sure I understand your comments correctly:

Do you suggest not to model address cost here like other partial vectorization 
style (while_ult, avx512...etc).
Then set COST = 1 since we only have SELECT_VL since beginning.
At various cases we saw, COST=1 is better than COST=2.

Thanks.



juzhe.zh...@rivai.ai
 
From: Richard Biener
Date: 2023-12-14 18:46
To: juzhe.zhong
CC: gcc-patches; richard.sandiford; jeffreyalaw
Subject: Re: [PATCH] Middle-end: Adjust decrement IV style partial 
vectorization COST model


Am 14.12.2023 um 09:28 schrieb juzhe.zh...@rivai.ai:

 
Hi, Richard.

I have a question about the decrement IV costing since I find the reduction 
case is generating inferior codegen.

reduc_plus_int:
mv a3,a0
ble a1,zero,.L7
addiw a5,a1,-1
li a4,2
bleu a5,a4,.L8
vsetivli zero,4,e32,m1,ta,ma
srliw a4,a1,2
vmv.v.i v1,0
slli a4,a4,4
add a4,a4,a0
mv a5,a0
.L4:
vle32.v v2,0(a5)
addi a5,a5,16
vadd.vv v1,v1,v2
bne a5,a4,.L4
li a5,0
vmv.s.x v2,a5
andi a5,a1,-4
vredsum.vs v1,v1,v2
vmv.x.s a0,v1
beq a1,a5,.L12
.L3:
subw a1,a1,a5
slli a5,a5,32
srli a5,a5,32
slli a1,a1,32
vsetvli a4,zero,e32,m1,ta,ma
slli a5,a5,2
srli a1,a1,32
vmv.v.i v1,0
add a3,a3,a5
vsetvli a1,a1,e8,mf4,ta,ma
vle32.v v3,0(a3)
li a5,0
vsetivli zero,1,e32,m1,ta,ma
vmv.s.x v2,a5
vsetvli zero,a1,e32,m1,tu,ma
vmv.v.v v1,v3
vsetvli a4,zero,e32,m1,ta,ma
vredsum.vs v1,v1,v2
vmv.x.s a5,v1
addw a0,a0,a5
ret
.L12:
ret
.L7:
li a0,0
ret
.L8:
li a5,0
li a0,0
j .L3

This patch adjust length_update_cost from 3 (original cost) into 2 can fix 
conversion case (the test append in this patch).
But can't fix reduction case.

Then I adjust it into COST = 1:

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 19e38b8637b..50c99b1fe79 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -4877,7 +4877,7 @@ vect_estimate_min_profitable_iters (loop_vec_info 
loop_vinfo,
                 processed in current iteration, and a SHIFT operation to
                 compute the next memory address instead of adding vectorization
                 factor.  */
-             length_update_cost = 2;
+             length_update_cost = 1;
            else
              /* For increment IV stype, Each may need two MINs and one MINUS to
                 update lengths in body for next iteration.  */

Then the codegen is reasonable now:

reduc_plus_int:
ble a1,zero,.L4
vsetvli a5,zero,e32,m1,ta,ma
vmv.v.i v1,0
.L3:
vsetvli a5,a1,e32,m1,tu,ma
vle32.v v2,0(a0)
slli a4,a5,2
sub a1,a1,a5
add a0,a0,a4
vadd.vv v1,v2,v1
bne a1,zero,.L3
li a5,0
vsetivli zero,1,e32,m1,ta,ma
vmv.s.x v2,a5
vsetvli a5,zero,e32,m1,ta,ma
vredsum.vs v1,v1,v2
vmv.x.s a0,v1
ret
.L4:
li a0,0
ret

The reason I set COST = 2 instead of 1 in this patch since

one COST is for SELECT_VL.

The other is for memory address calculation since we don't update memory 
address with adding VF directly,
instead:

We shift the result of SELECV_VL, and then add it into the memory IV as follows:

SSA_1 = SELECT_VL --> SSA_1 is element-wise
SSA_2 = SSA_1 << 1 (If element is INT16, make it to be bytes-wise)
next iteration memory address = current iteration memory address + SSA_2.

If elemente is INT8, then the shift operation is not needed:
SSA_2 = SSA_1 << 1 since it is already byte-wise.

So, my question is the COST should be 1 or 2.
It seems that COST = 1 is better for
using SELECT_VL.

We are not modeling address cost, so trying to account for this here is only 
going to be a heuristic that’s as many times wrong than it is correct.  If the 
difference between SELECT_VL and not is so small  then you’ll have a hard time 
modeling this here.

 
Thanks.


juzhe.zh...@rivai.ai
 
From: Richard Biener
Date: 2023-12-13 18:17
To: Juzhe-Zhong
CC: gcc-patches; richard.sandiford; jeffreyalaw
Subject: Re: [PATCH] Middle-end: Adjust decrement IV style partial 
vectorization COST model
On Wed, 13 Dec 2023, Juzhe-Zhong wrote:
 
> Hi, before this patch, a simple conversion case for RVV codegen:
> 
> foo:
>         ble     a2,zero,.L8
>         addiw   a5,a2,-1
>         li      a4,6
>         bleu    a5,a4,.L6
>         srliw   a3,a2,3
>         slli    a3,a3,3
>         add     a3,a3,a0
>         mv      a5,a0
>         mv      a4,a1
>         vsetivli        zero,8,e16,m1,ta,ma
> .L4:
>         vle8.v  v2,0(a5)
>         addi    a5,a5,8
>         vzext.vf2       v1,v2
>         vse16.v v1,0(a4)
>         addi    a4,a4,16
>         bne     a3,a5,.L4
>         andi    a5,a2,-8
>         beq     a2,a5,.L10
> .L3:
>         slli    a4,a5,32
>         srli    a4,a4,32
>         subw    a2,a2,a5
>         slli    a2,a2,32
>         slli    a5,a4,1
>         srli    a2,a2,32
>         add     a0,a0,a4
>         add     a1,a1,a5
>         vsetvli zero,a2,e16,m1,ta,ma
>         vle8.v  v2,0(a0)
>         vzext.vf2       v1,v2
>         vse16.v v1,0(a1)
> .L8:
>         ret
> .L10:
>         ret
> .L6:
>         li      a5,0
>         j       .L3
> 
> This vectorization go through first loop:
> 
>         vsetivli        zero,8,e16,m1,ta,ma
> .L4:
>         vle8.v  v2,0(a5)
>         addi    a5,a5,8
>         vzext.vf2       v1,v2
>         vse16.v v1,0(a4)
>         addi    a4,a4,16
>         bne     a3,a5,.L4
> 
> Each iteration processes 8 elements.
> 
> For a scalable vectorization with VLEN > 128 bits CPU, it's ok when VLEN = 
> 128.
> But, as long as VLEN > 128 bits, it will waste the CPU resources. That is, 
> e.g. VLEN = 256bits.
> only half of the vector units are working and another half is idle.
> 
> After investigation, I realize that I forgot to adjust COST for SELECT_VL.
> So, adjust COST for SELECT_VL styple length vectorization. We adjust COST 
> from 3 to 2. since
> after this patch:
> 
> foo:
> ble a2,zero,.L5
> .L3:
> vsetvli a5,a2,e16,m1,ta,ma     -----> SELECT_VL cost.
> vle8.v v2,0(a0)
> slli a4,a5,1                -----> additional shift of outcome SELECT_VL for 
> memory address calculation.
> vzext.vf2 v1,v2
> sub a2,a2,a5
> vse16.v v1,0(a1)
> add a0,a0,a5
> add a1,a1,a4
> bne a2,zero,.L3
> .L5:
> ret
> 
> This patch is a simple fix that I previous forgot.
> 
> Ok for trunk ?
 
OK.
 
Richard.
 
> If not, I am going to adjust cost in backend cost model.
> 
> PR target/111317
> 
> gcc/ChangeLog:
> 
> * tree-vect-loop.cc (vect_estimate_min_profitable_iters): Adjust for COST for 
> decrement IV.
> 
> gcc/testsuite/ChangeLog:
> 
> * gcc.dg/vect/costmodel/riscv/rvv/pr111317.c: New test.
> 
> ---
>  .../gcc.dg/vect/costmodel/riscv/rvv/pr111317.c  | 12 ++++++++++++
>  gcc/tree-vect-loop.cc                           | 17 ++++++++++++++---
>  2 files changed, 26 insertions(+), 3 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr111317.c
> 
> diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr111317.c 
> b/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr111317.c
> new file mode 100644
> index 00000000000..d4bea242a9a
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr111317.c
> @@ -0,0 +1,12 @@
> +/* { dg-do compile } */
> +/* { dg-options "-march=rv64gcv -mabi=lp64d -O3 -ftree-vectorize 
> --param=riscv-autovec-lmul=m1" } */
> +
> +void
> +foo (char *__restrict a, short *__restrict b, int n)
> +{
> +  for (int i = 0; i < n; i++)
> +    b[i] = (short) a[i];
> +}
> +
> +/* { dg-final { scan-assembler-times 
> {vsetvli\s+[a-x0-9]+,\s*[a-x0-9]+,\s*e16,\s*m1,\s*t[au],\s*m[au]} 1 } } */
> +/* { dg-final { scan-assembler-times {vsetvli} 1 } } */
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index 6261cd1be1d..19e38b8637b 100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -4870,10 +4870,21 @@ vect_estimate_min_profitable_iters (loop_vec_info 
> loop_vinfo,
>      if (partial_load_store_bias != 0)
>        body_stmts += 1;
>  
> -     /* Each may need two MINs and one MINUS to update lengths in body
> -        for next iteration.  */
> +     unsigned int length_update_cost = 0;
> +     if (LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo))
> +       /* For decrement IV style, we use a single SELECT_VL since
> + beginning to calculate the number of elements need to be
> + processed in current iteration, and a SHIFT operation to
> + compute the next memory address instead of adding vectorization
> + factor.  */
> +       length_update_cost = 2;
> +     else
> +       /* For increment IV stype, Each may need two MINs and one MINUS to
> + update lengths in body for next iteration.  */
> +       length_update_cost = 3;
> +
>      if (need_iterate_p)
> -       body_stmts += 3 * num_vectors;
> +       body_stmts += length_update_cost * num_vectors;
>    }
>  
>        (void) add_stmt_cost (target_cost_data, prologue_stmts,
> 
 
-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
 

Reply via email to