Re: Re: [PATCH] RISC-V: Enhance RVV VLA SLP auto-vectorization
Sure. Sent it: https://gcc.gnu.org/pipermail/gcc-patches/2023-June/622836.html juzhe.zh...@rivai.ai From: Kito Cheng Date: 2023-06-26 17:10 To: Robin Dapp CC: Juzhe-Zhong; gcc-patches; kito.cheng; palmer; palmer; jeffreyalaw Subject: Re: [PATCH] RISC-V: Enhance RVV VLA SLP auto-vectorization Could you re-title this patch into something like "Support const vector expansion with xxx pattern", On Mon, Jun 26, 2023 at 3:52 PM Robin Dapp via Gcc-patches wrote: > > Hi Juzhe, > > > Currently, we are able to generate step vector with base == 0: > > { 0, 0, 2, 2, 4, 4, ... } > > > > ASM: > > > > vid > > vand > > > > However, we do wrong for step vector with base != 0: > > { 1, 1, 3, 3, 5, 5, ... } > > > > Before this patch, such case will run fail. > > > > After this patch, we are able to pass the testcase and generate the step > > vector with asm: > > > > vid > > vand > > vadd > > Can't we use the first case as long as pow2_p (base) == true > and not just for base == 0? > > Regards > Robin >
Re: Re: [PATCH] RISC-V: Enhance RVV VLA SLP auto-vectorization
No, I don't think we can use first case (vid + vand -npatterns) for base != 0, since the first element value of vid is alway 0. Thanks. juzhe.zh...@rivai.ai From: Robin Dapp Date: 2023-06-26 15:51 To: Juzhe-Zhong; gcc-patches CC: rdapp.gcc; kito.cheng; kito.cheng; palmer; palmer; jeffreyalaw Subject: Re: [PATCH] RISC-V: Enhance RVV VLA SLP auto-vectorization Hi Juzhe, > Currently, we are able to generate step vector with base == 0: > { 0, 0, 2, 2, 4, 4, ... } > > ASM: > > vid > vand > > However, we do wrong for step vector with base != 0: > { 1, 1, 3, 3, 5, 5, ... } > > Before this patch, such case will run fail. > > After this patch, we are able to pass the testcase and generate the step > vector with asm: > > vid > vand > vadd Can't we use the first case as long as pow2_p (base) == true and not just for base == 0? Regards Robin
Re: [PATCH] RISC-V: Enhance RVV VLA SLP auto-vectorization
Could you re-title this patch into something like "Support const vector expansion with xxx pattern", On Mon, Jun 26, 2023 at 3:52 PM Robin Dapp via Gcc-patches wrote: > > Hi Juzhe, > > > Currently, we are able to generate step vector with base == 0: > > { 0, 0, 2, 2, 4, 4, ... } > > > > ASM: > > > > vid > > vand > > > > However, we do wrong for step vector with base != 0: > > { 1, 1, 3, 3, 5, 5, ... } > > > > Before this patch, such case will run fail. > > > > After this patch, we are able to pass the testcase and generate the step > > vector with asm: > > > > vid > > vand > > vadd > > Can't we use the first case as long as pow2_p (base) == true > and not just for base == 0? > > Regards > Robin >
Re: [PATCH] RISC-V: Enhance RVV VLA SLP auto-vectorization
Hi Juzhe, > Currently, we are able to generate step vector with base == 0: > { 0, 0, 2, 2, 4, 4, ... } > > ASM: > > vid > vand > > However, we do wrong for step vector with base != 0: > { 1, 1, 3, 3, 5, 5, ... } > > Before this patch, such case will run fail. > > After this patch, we are able to pass the testcase and generate the step > vector with asm: > > vid > vand > vadd Can't we use the first case as long as pow2_p (base) == true and not just for base == 0? Regards Robin
[PATCH] RISC-V: Enhance RVV VLA SLP auto-vectorization
Currently, we are able to generate step vector with base == 0: { 0, 0, 2, 2, 4, 4, ... } ASM: vid vand However, we do wrong for step vector with base != 0: { 1, 1, 3, 3, 5, 5, ... } Before this patch, such case will run fail. After this patch, we are able to pass the testcase and generate the step vector with asm: vid vand vadd gcc/ChangeLog: * config/riscv/riscv-v.cc (expand_const_vector): Fix stepped vector with base != 0. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/autovec/partial/slp-17.c: New test. * gcc.target/riscv/rvv/autovec/partial/slp-18.c: New test. * gcc.target/riscv/rvv/autovec/partial/slp-19.c: New test. * gcc.target/riscv/rvv/autovec/partial/slp_run-17.c: New test. * gcc.target/riscv/rvv/autovec/partial/slp_run-18.c: New test. * gcc.target/riscv/rvv/autovec/partial/slp_run-19.c: New test. --- gcc/config/riscv/riscv-v.cc | 14 +++- .../riscv/rvv/autovec/partial/slp-17.c| 34 .../riscv/rvv/autovec/partial/slp-18.c| 26 ++ .../riscv/rvv/autovec/partial/slp-19.c| 26 ++ .../riscv/rvv/autovec/partial/slp_run-17.c| 84 +++ .../riscv/rvv/autovec/partial/slp_run-18.c| 69 +++ .../riscv/rvv/autovec/partial/slp_run-19.c| 69 +++ 7 files changed, 320 insertions(+), 2 deletions(-) create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp-17.c create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp-18.c create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp-19.c create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp_run-17.c create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp_run-18.c create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp_run-19.c diff --git a/gcc/config/riscv/riscv-v.cc b/gcc/config/riscv/riscv-v.cc index 5518394be1e..cd3422bf711 100644 --- a/gcc/config/riscv/riscv-v.cc +++ b/gcc/config/riscv/riscv-v.cc @@ -1258,7 +1258,6 @@ expand_const_vector (rtx target, rtx src) } emit_move_insn (target, tmp); } - return; } else if (CONST_VECTOR_STEPPED_P (src)) { @@ -1287,9 +1286,20 @@ expand_const_vector (rtx target, rtx src) */ rtx imm = gen_int_mode (-builder.npatterns (), builder.inner_mode ()); - rtx and_ops[] = {target, vid, imm}; + rtx tmp = gen_reg_rtx (builder.mode ()); + rtx and_ops[] = {tmp, vid, imm}; icode = code_for_pred_scalar (AND, builder.mode ()); emit_vlmax_insn (icode, RVV_BINOP, and_ops); + HOST_WIDE_INT init_val = INTVAL (builder.elt (0)); + if (init_val == 0) + emit_move_insn (target, tmp); + else + { + rtx dup = gen_const_vector_dup (builder.mode (), init_val); + rtx add_ops[] = {target, tmp, dup}; + icode = code_for_pred (PLUS, builder.mode ()); + emit_vlmax_insn (icode, RVV_BINOP, add_ops); + } } else { diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp-17.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp-17.c new file mode 100644 index 000..2f2c3d11c2a --- /dev/null +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp-17.c @@ -0,0 +1,34 @@ +/* { dg-do compile } */ +/* { dg-additional-options "-march=rv32gcv -mabi=ilp32d --param riscv-autovec-preference=scalable -fdump-tree-optimized-details" } */ + +#include + +void +f (uint8_t *restrict a, uint8_t *restrict b, + uint8_t *restrict c, uint8_t *restrict d, + int n) +{ + for (int i = 0; i < n; ++i) +{ + a[i * 8] = c[i * 8] + d[i * 8]; + a[i * 8 + 1] = c[i * 8] + d[i * 8 + 1]; + a[i * 8 + 2] = c[i * 8 + 2] + d[i * 8 + 2]; + a[i * 8 + 3] = c[i * 8 + 2] + d[i * 8 + 3]; + a[i * 8 + 4] = c[i * 8 + 4] + d[i * 8 + 4]; + a[i * 8 + 5] = c[i * 8 + 4] + d[i * 8 + 5]; + a[i * 8 + 6] = c[i * 8 + 6] + d[i * 8 + 6]; + a[i * 8 + 7] = c[i * 8 + 6] + d[i * 8 + 7]; + b[i * 8] = c[i * 8 + 1] + d[i * 8]; + b[i * 8 + 1] = c[i * 8 + 1] + d[i * 8 + 1]; + b[i * 8 + 2] = c[i * 8 + 3] + d[i * 8 + 2]; + b[i * 8 + 3] = c[i * 8 + 3] + d[i * 8 + 3]; + b[i * 8 + 4] = c[i * 8 + 5] + d[i * 8 + 4]; + b[i * 8 + 5] = c[i * 8 + 5] + d[i * 8 + 5]; + b[i * 8 + 6] = c[i * 8 + 7] + d[i * 8 + 6]; + b[i * 8 + 7] = c[i * 8 + 7] + d[i * 8 + 7]; +} +} + +/* { dg-final { scan-tree-dump-times "\.VEC_PERM" 2 "optimized" } } */ +/* { dg-final { scan-assembler {\tvid\.v} } } */ +/* { dg-final { scan-assembler-not {\tvmul} } } */ diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp-18.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp-18.c new file mode 100644 index
Re: [PATCH] RISC-V: Enhance RVV VLA SLP auto-vectorization with decompress operation
On 6/12/23 08:54, Kito Cheng wrote: I didn't take a close review yet, (and I suspect I can't find time before I start my vacation :P), but I am thinking we may adding selftests for expand_const_vector in *future*, again, not blocker for this patch :) I'll take this one. Go enjoy your vacation! jeff
Re: Re: [PATCH] RISC-V: Enhance RVV VLA SLP auto-vectorization with decompress operation
I didn't take a close review yet, (and I suspect I can't find time before I start my vacation :P), but I am thinking we may adding selftests for expand_const_vector in *future*, again, not blocker for this patch :) On Mon, Jun 12, 2023 at 10:51 PM 钟居哲 wrote: > > No. Such pattern you pointed I already supported. > The operation is very simple. > Just use a single vmv.v.i but larger SEW is enough. No need vzext. > > > juzhe.zh...@rivai.ai > > > From: Robin Dapp > Date: 2023-06-12 22:43 > To: juzhe.zhong; gcc-patches > CC: rdapp.gcc; kito.cheng; kito.cheng; palmer; palmer; jeffreyalaw > Subject: Re: [PATCH] RISC-V: Enhance RVV VLA SLP auto-vectorization with > decompress operation > Hi Juzhe, > > seems a nice improvement, looks good to me. While reading I was wondering > if vzext could help synthesize some (zero-based) patterns as well > (e.g. 0 3 0 3...). > However the sequences I could come up with were not shorter than what we > are already emitting, so probably not. > > Regards > Robin >
Re: Re: [PATCH] RISC-V: Enhance RVV VLA SLP auto-vectorization with decompress operation
No. Such pattern you pointed I already supported. The operation is very simple. Just use a single vmv.v.i but larger SEW is enough. No need vzext. juzhe.zh...@rivai.ai From: Robin Dapp Date: 2023-06-12 22:43 To: juzhe.zhong; gcc-patches CC: rdapp.gcc; kito.cheng; kito.cheng; palmer; palmer; jeffreyalaw Subject: Re: [PATCH] RISC-V: Enhance RVV VLA SLP auto-vectorization with decompress operation Hi Juzhe, seems a nice improvement, looks good to me. While reading I was wondering if vzext could help synthesize some (zero-based) patterns as well (e.g. 0 3 0 3...). However the sequences I could come up with were not shorter than what we are already emitting, so probably not. Regards Robin
Re: [PATCH] RISC-V: Enhance RVV VLA SLP auto-vectorization with decompress operation
Hi Juzhe, seems a nice improvement, looks good to me. While reading I was wondering if vzext could help synthesize some (zero-based) patterns as well (e.g. 0 3 0 3...). However the sequences I could come up with were not shorter than what we are already emitting, so probably not. Regards Robin
[PATCH] RISC-V: Enhance RVV VLA SLP auto-vectorization with decompress operation
From: Juzhe-Zhong According to RVV ISA: https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc We can enhance VLA SLP auto-vectorization with (16.5.1. Synthesizing vdecompress) Decompress operation. Case 1 (nunits = POLY_INT_CST [16, 16]): _48 = VEC_PERM_EXPR <_37, _35, { 0, POLY_INT_CST [16, 16], 1, POLY_INT_CST [17, 16], 2, POLY_INT_CST [18, 16], ... }>; We can optimize such VLA SLP permuation pattern into: _48 = vdecompress (_37, _35, mask = { 0, 1, 0, 1, ... }; Case 2 (nunits = POLY_INT_CST [16, 16]): _23 = VEC_PERM_EXPR <_46, _44, { POLY_INT_CST [1, 1], POLY_INT_CST [3, 3], POLY_INT_CST [2, 1], POLY_INT_CST [4, 3], POLY_INT_CST [3, 1], POLY_INT_CST [5, 3], ... }>; We can optimize such VLA SLP permuation pattern into: _48 = vdecompress (slidedown(_46, 1/2 nunits), slidedown(_44, 1/2 nunits), mask = { 0, 1, 0, 1, ... }; For example: void __attribute__ ((noinline, noclone)) vec_slp (uint64_t *restrict a, uint64_t b, uint64_t c, int n) { for (int i = 0; i < n; ++i) { a[i * 2] += b; a[i * 2 + 1] += c; } } ASM: ... vid.v v0 vand.vi v0,v0,1 vmseq.viv0,v0,1 ===> mask = { 0, 1, 0, 1, ... } vdecompress: viota.m v3,v0 vrgather.vv v2,v1,v3,v0.t ... gcc/ChangeLog: * config/riscv/riscv-v.cc (emit_vlmax_decompress_insn): New function. (expand_const_vector): Enhance repeating sequence mask. (shuffle_decompress_patterns): New function. (expand_vec_perm_const_1): Add decompress optimization. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/autovec/partial/slp-8.c: New test. * gcc.target/riscv/rvv/autovec/partial/slp-9.c: New test. * gcc.target/riscv/rvv/autovec/partial/slp_run-8.c: New test. * gcc.target/riscv/rvv/autovec/partial/slp_run-9.c: New test. --- gcc/config/riscv/riscv-v.cc | 146 +- .../riscv/rvv/autovec/partial/slp-8.c | 30 .../riscv/rvv/autovec/partial/slp-9.c | 31 .../riscv/rvv/autovec/partial/slp_run-8.c | 30 .../riscv/rvv/autovec/partial/slp_run-9.c | 30 5 files changed, 260 insertions(+), 7 deletions(-) create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp-8.c create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp-9.c create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp_run-8.c create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp_run-9.c diff --git a/gcc/config/riscv/riscv-v.cc b/gcc/config/riscv/riscv-v.cc index e1b85a5af91..3cea6b25261 100644 --- a/gcc/config/riscv/riscv-v.cc +++ b/gcc/config/riscv/riscv-v.cc @@ -836,6 +836,46 @@ emit_vlmax_masked_gather_mu_insn (rtx target, rtx op, rtx sel, rtx mask) emit_vlmax_masked_mu_insn (icode, RVV_BINOP_MU, ops); } +/* According to RVV ISA spec (16.5.1. Synthesizing vdecompress): + https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc + + There is no inverse vdecompress provided, as this operation can be readily + synthesized using iota and a masked vrgather: + + Desired functionality of 'vdecompress' + 7 6 5 4 3 2 1 0 # vid + + e d c b a # packed vector of 5 elements + 1 0 0 1 1 1 0 1 # mask vector of 8 elements + p q r s t u v w # destination register before vdecompress + + e q r d c b v a # result of vdecompress + # v0 holds mask + # v1 holds packed data + # v11 holds input expanded vector and result + viota.m v10, v0 # Calc iota from mask in v0 + vrgather.vv v11, v1, v10, v0.t # Expand into destination + p q r s t u v w # v11 destination register + e d c b a # v1 source vector + 1 0 0 1 1 1 0 1 # v0 mask vector + + 4 4 4 3 2 1 1 0 # v10 result of viota.m + e q r d c b v a # v11 destination after vrgather using viota.m under mask +*/ +static void +emit_vlmax_decompress_insn (rtx target, rtx op, rtx mask) +{ + machine_mode data_mode = GET_MODE (target); + machine_mode sel_mode = related_int_vector_mode (data_mode).require (); + if (GET_MODE_INNER (data_mode) == QImode) +sel_mode = get_vector_mode (HImode, GET_MODE_NUNITS (data_mode)).require (); + + rtx sel = gen_reg_rtx (sel_mode); + rtx iota_ops[] = {sel, mask}; + emit_vlmax_insn (code_for_pred_iota (sel_mode), RVV_UNOP, iota_ops); + emit_vlmax_masked_gather_mu_insn (target, op, sel, mask); +} + /* Emit merge instruction. */ static machine_mode @@ -934,14 +974,41 @@ expand_const_vector (rtx target, rtx src) { machine_mode mode = GET_MODE (target); scalar_mode elt_mode = GET_MODE_INNER (mode); + poly_uint64 nunits = GET_MODE_NUNITS (mode); + unsigned int nelts_per_pattern = CONST_VECTOR_NELTS_PER_PATTERN (src); + unsigned int npatterns = CONST_VECTOR_NPATTERNS (src); if (GET_MODE_CLASS (mode) == MODE_VECTOR_BOOL) { rtx elt; - gcc_assert ( - const_vec_duplicate_p