Re: Re: [PATCH] RISC-V: Enhance RVV VLA SLP auto-vectorization

2023-06-26 Thread juzhe.zh...@rivai.ai
Sure. Sent it:
https://gcc.gnu.org/pipermail/gcc-patches/2023-June/622836.html 




juzhe.zh...@rivai.ai
 
From: Kito Cheng
Date: 2023-06-26 17:10
To: Robin Dapp
CC: Juzhe-Zhong; gcc-patches; kito.cheng; palmer; palmer; jeffreyalaw
Subject: Re: [PATCH] RISC-V: Enhance RVV VLA SLP auto-vectorization
Could you re-title this patch into something like "Support const
vector expansion with xxx pattern",
 
On Mon, Jun 26, 2023 at 3:52 PM Robin Dapp via Gcc-patches
 wrote:
>
> Hi Juzhe,
>
> > Currently, we are able to generate step vector with base == 0:
> >  { 0, 0, 2, 2, 4, 4, ... }
> >
> > ASM:
> >
> > vid
> > vand
> >
> > However, we do wrong for step vector with base != 0:
> > { 1, 1, 3, 3, 5, 5, ... }
> >
> > Before this patch, such case will run fail.
> >
> > After this patch, we are able to pass the testcase and generate the step 
> > vector with asm:
> >
> > vid
> > vand
> > vadd
>
> Can't we use the first case as long as pow2_p (base) == true
> and not just for base == 0?
>
> Regards
>  Robin
>
 


Re: Re: [PATCH] RISC-V: Enhance RVV VLA SLP auto-vectorization

2023-06-26 Thread juzhe.zh...@rivai.ai
No, I don't think we can use first case (vid + vand -npatterns) for base != 0,
since the first element value of vid is alway 0.

Thanks.


juzhe.zh...@rivai.ai
 
From: Robin Dapp
Date: 2023-06-26 15:51
To: Juzhe-Zhong; gcc-patches
CC: rdapp.gcc; kito.cheng; kito.cheng; palmer; palmer; jeffreyalaw
Subject: Re: [PATCH] RISC-V: Enhance RVV VLA SLP auto-vectorization
Hi Juzhe,
 
> Currently, we are able to generate step vector with base == 0:
>  { 0, 0, 2, 2, 4, 4, ... }
> 
> ASM:
> 
> vid
> vand
> 
> However, we do wrong for step vector with base != 0:
> { 1, 1, 3, 3, 5, 5, ... }
> 
> Before this patch, such case will run fail.
> 
> After this patch, we are able to pass the testcase and generate the step 
> vector with asm:
> 
> vid
> vand
> vadd
 
Can't we use the first case as long as pow2_p (base) == true
and not just for base == 0?
 
Regards
Robin
 
 


Re: [PATCH] RISC-V: Enhance RVV VLA SLP auto-vectorization

2023-06-26 Thread Kito Cheng via Gcc-patches
Could you re-title this patch into something like "Support const
vector expansion with xxx pattern",

On Mon, Jun 26, 2023 at 3:52 PM Robin Dapp via Gcc-patches
 wrote:
>
> Hi Juzhe,
>
> > Currently, we are able to generate step vector with base == 0:
> >  { 0, 0, 2, 2, 4, 4, ... }
> >
> > ASM:
> >
> > vid
> > vand
> >
> > However, we do wrong for step vector with base != 0:
> > { 1, 1, 3, 3, 5, 5, ... }
> >
> > Before this patch, such case will run fail.
> >
> > After this patch, we are able to pass the testcase and generate the step 
> > vector with asm:
> >
> > vid
> > vand
> > vadd
>
> Can't we use the first case as long as pow2_p (base) == true
> and not just for base == 0?
>
> Regards
>  Robin
>


Re: [PATCH] RISC-V: Enhance RVV VLA SLP auto-vectorization

2023-06-26 Thread Robin Dapp via Gcc-patches
Hi Juzhe,

> Currently, we are able to generate step vector with base == 0:
>  { 0, 0, 2, 2, 4, 4, ... }
> 
> ASM:
> 
> vid
> vand
> 
> However, we do wrong for step vector with base != 0:
> { 1, 1, 3, 3, 5, 5, ... }
> 
> Before this patch, such case will run fail.
> 
> After this patch, we are able to pass the testcase and generate the step 
> vector with asm:
> 
> vid
> vand
> vadd

Can't we use the first case as long as pow2_p (base) == true
and not just for base == 0?

Regards
 Robin



[PATCH] RISC-V: Enhance RVV VLA SLP auto-vectorization

2023-06-26 Thread Juzhe-Zhong
Currently, we are able to generate step vector with base == 0:
 { 0, 0, 2, 2, 4, 4, ... }

ASM:

vid
vand

However, we do wrong for step vector with base != 0:
{ 1, 1, 3, 3, 5, 5, ... }

Before this patch, such case will run fail.

After this patch, we are able to pass the testcase and generate the step vector 
with asm:

vid
vand
vadd

gcc/ChangeLog:

* config/riscv/riscv-v.cc (expand_const_vector): Fix stepped vector 
with base != 0.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/partial/slp-17.c: New test.
* gcc.target/riscv/rvv/autovec/partial/slp-18.c: New test.
* gcc.target/riscv/rvv/autovec/partial/slp-19.c: New test.
* gcc.target/riscv/rvv/autovec/partial/slp_run-17.c: New test.
* gcc.target/riscv/rvv/autovec/partial/slp_run-18.c: New test.
* gcc.target/riscv/rvv/autovec/partial/slp_run-19.c: New test.

---
 gcc/config/riscv/riscv-v.cc   | 14 +++-
 .../riscv/rvv/autovec/partial/slp-17.c| 34 
 .../riscv/rvv/autovec/partial/slp-18.c| 26 ++
 .../riscv/rvv/autovec/partial/slp-19.c| 26 ++
 .../riscv/rvv/autovec/partial/slp_run-17.c| 84 +++
 .../riscv/rvv/autovec/partial/slp_run-18.c| 69 +++
 .../riscv/rvv/autovec/partial/slp_run-19.c| 69 +++
 7 files changed, 320 insertions(+), 2 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp-17.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp-18.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp-19.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp_run-17.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp_run-18.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp_run-19.c

diff --git a/gcc/config/riscv/riscv-v.cc b/gcc/config/riscv/riscv-v.cc
index 5518394be1e..cd3422bf711 100644
--- a/gcc/config/riscv/riscv-v.cc
+++ b/gcc/config/riscv/riscv-v.cc
@@ -1258,7 +1258,6 @@ expand_const_vector (rtx target, rtx src)
}
  emit_move_insn (target, tmp);
}
-  return;
 }
   else if (CONST_VECTOR_STEPPED_P (src))
 {
@@ -1287,9 +1286,20 @@ expand_const_vector (rtx target, rtx src)
  */
  rtx imm
= gen_int_mode (-builder.npatterns (), builder.inner_mode ());
- rtx and_ops[] = {target, vid, imm};
+ rtx tmp = gen_reg_rtx (builder.mode ());
+ rtx and_ops[] = {tmp, vid, imm};
  icode = code_for_pred_scalar (AND, builder.mode ());
  emit_vlmax_insn (icode, RVV_BINOP, and_ops);
+ HOST_WIDE_INT init_val = INTVAL (builder.elt (0));
+ if (init_val == 0)
+   emit_move_insn (target, tmp);
+ else
+   {
+ rtx dup = gen_const_vector_dup (builder.mode (), init_val);
+ rtx add_ops[] = {target, tmp, dup};
+ icode = code_for_pred (PLUS, builder.mode ());
+ emit_vlmax_insn (icode, RVV_BINOP, add_ops);
+   }
}
  else
{
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp-17.c 
b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp-17.c
new file mode 100644
index 000..2f2c3d11c2a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp-17.c
@@ -0,0 +1,34 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-march=rv32gcv -mabi=ilp32d --param 
riscv-autovec-preference=scalable -fdump-tree-optimized-details" } */
+
+#include 
+
+void
+f (uint8_t *restrict a, uint8_t *restrict b,
+   uint8_t *restrict c, uint8_t *restrict d,
+   int n)
+{
+  for (int i = 0; i < n; ++i)
+{
+  a[i * 8] = c[i * 8] + d[i * 8];
+  a[i * 8 + 1] = c[i * 8] + d[i * 8 + 1];
+  a[i * 8 + 2] = c[i * 8 + 2] + d[i * 8 + 2];
+  a[i * 8 + 3] = c[i * 8 + 2] + d[i * 8 + 3];
+  a[i * 8 + 4] = c[i * 8 + 4] + d[i * 8 + 4];
+  a[i * 8 + 5] = c[i * 8 + 4] + d[i * 8 + 5];
+  a[i * 8 + 6] = c[i * 8 + 6] + d[i * 8 + 6];
+  a[i * 8 + 7] = c[i * 8 + 6] + d[i * 8 + 7];
+  b[i * 8] = c[i * 8 + 1] + d[i * 8];
+  b[i * 8 + 1] = c[i * 8 + 1] + d[i * 8 + 1];
+  b[i * 8 + 2] = c[i * 8 + 3] + d[i * 8 + 2];
+  b[i * 8 + 3] = c[i * 8 + 3] + d[i * 8 + 3];
+  b[i * 8 + 4] = c[i * 8 + 5] + d[i * 8 + 4];
+  b[i * 8 + 5] = c[i * 8 + 5] + d[i * 8 + 5];
+  b[i * 8 + 6] = c[i * 8 + 7] + d[i * 8 + 6];
+  b[i * 8 + 7] = c[i * 8 + 7] + d[i * 8 + 7];
+}
+}
+
+/* { dg-final { scan-tree-dump-times "\.VEC_PERM" 2 "optimized" } } */
+/* { dg-final { scan-assembler {\tvid\.v} } } */
+/* { dg-final { scan-assembler-not {\tvmul} } } */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp-18.c 
b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp-18.c
new file mode 100644
index 

Re: [PATCH] RISC-V: Enhance RVV VLA SLP auto-vectorization with decompress operation

2023-06-12 Thread Jeff Law via Gcc-patches




On 6/12/23 08:54, Kito Cheng wrote:

I didn't take a close review yet, (and I suspect I can't find time
before I start my vacation :P), but I am thinking we may adding
selftests for expand_const_vector in *future*, again, not blocker for
this patch :)

I'll take this one.  Go enjoy your vacation!

jeff


Re: Re: [PATCH] RISC-V: Enhance RVV VLA SLP auto-vectorization with decompress operation

2023-06-12 Thread Kito Cheng via Gcc-patches
I didn't take a close review yet, (and I suspect I can't find time
before I start my vacation :P), but I am thinking we may adding
selftests for expand_const_vector in *future*, again, not blocker for
this patch :)

On Mon, Jun 12, 2023 at 10:51 PM 钟居哲  wrote:
>
> No. Such pattern you pointed I already supported.
> The operation is very simple.
> Just use a single vmv.v.i but larger SEW is enough. No need vzext.
>
> 
> juzhe.zh...@rivai.ai
>
>
> From: Robin Dapp
> Date: 2023-06-12 22:43
> To: juzhe.zhong; gcc-patches
> CC: rdapp.gcc; kito.cheng; kito.cheng; palmer; palmer; jeffreyalaw
> Subject: Re: [PATCH] RISC-V: Enhance RVV VLA SLP auto-vectorization with 
> decompress operation
> Hi Juzhe,
>
> seems a nice improvement, looks good to me.  While reading I was wondering
> if vzext could help synthesize some (zero-based) patterns as well
> (e.g. 0 3 0 3...).
> However the sequences I could come up with were not shorter than what we
> are already emitting, so probably not.
>
> Regards
> Robin
>


Re: Re: [PATCH] RISC-V: Enhance RVV VLA SLP auto-vectorization with decompress operation

2023-06-12 Thread 钟居哲
No. Such pattern you pointed I already supported.
The operation is very simple.
Just use a single vmv.v.i but larger SEW is enough. No need vzext.



juzhe.zh...@rivai.ai
 
From: Robin Dapp
Date: 2023-06-12 22:43
To: juzhe.zhong; gcc-patches
CC: rdapp.gcc; kito.cheng; kito.cheng; palmer; palmer; jeffreyalaw
Subject: Re: [PATCH] RISC-V: Enhance RVV VLA SLP auto-vectorization with 
decompress operation
Hi Juzhe,
 
seems a nice improvement, looks good to me.  While reading I was wondering
if vzext could help synthesize some (zero-based) patterns as well
(e.g. 0 3 0 3...).
However the sequences I could come up with were not shorter than what we
are already emitting, so probably not.
 
Regards
Robin
 


Re: [PATCH] RISC-V: Enhance RVV VLA SLP auto-vectorization with decompress operation

2023-06-12 Thread Robin Dapp via Gcc-patches
Hi Juzhe,

seems a nice improvement, looks good to me.  While reading I was wondering
if vzext could help synthesize some (zero-based) patterns as well
(e.g. 0 3 0 3...).
However the sequences I could come up with were not shorter than what we
are already emitting, so probably not.

Regards
 Robin


[PATCH] RISC-V: Enhance RVV VLA SLP auto-vectorization with decompress operation

2023-06-12 Thread juzhe . zhong
From: Juzhe-Zhong 

According to RVV ISA:
https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc

We can enhance VLA SLP auto-vectorization with (16.5.1. Synthesizing 
vdecompress)
Decompress operation.

Case 1 (nunits = POLY_INT_CST [16, 16]):
_48 = VEC_PERM_EXPR <_37, _35, { 0, POLY_INT_CST [16, 16], 1, POLY_INT_CST [17, 
16], 2, POLY_INT_CST [18, 16], ... }>;
We can optimize such VLA SLP permuation pattern into:
_48 = vdecompress (_37, _35, mask = { 0, 1, 0, 1, ... };

Case 2 (nunits = POLY_INT_CST [16, 16]):
_23 = VEC_PERM_EXPR <_46, _44, { POLY_INT_CST [1, 1], POLY_INT_CST [3, 3], 
POLY_INT_CST [2, 1], POLY_INT_CST [4, 3], POLY_INT_CST [3, 1], POLY_INT_CST [5, 
3], ... }>;
We can optimize such VLA SLP permuation pattern into:
_48 = vdecompress (slidedown(_46, 1/2 nunits), slidedown(_44, 1/2 nunits), mask 
= { 0, 1, 0, 1, ... };

For example:
void __attribute__ ((noinline, noclone))
vec_slp (uint64_t *restrict a, uint64_t b, uint64_t c, int n)
{
  for (int i = 0; i < n; ++i)
{
  a[i * 2] += b;
  a[i * 2 + 1] += c;
}
}

ASM:
...
vid.v   v0
vand.vi v0,v0,1
vmseq.viv0,v0,1  ===> mask = { 0, 1, 0, 1, ... }
vdecompress:
viota.m v3,v0   
vrgather.vv v2,v1,v3,v0.t
...

gcc/ChangeLog:

* config/riscv/riscv-v.cc (emit_vlmax_decompress_insn): New function.
(expand_const_vector): Enhance repeating sequence mask.
(shuffle_decompress_patterns): New function.
(expand_vec_perm_const_1): Add decompress optimization.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/partial/slp-8.c: New test.
* gcc.target/riscv/rvv/autovec/partial/slp-9.c: New test.
* gcc.target/riscv/rvv/autovec/partial/slp_run-8.c: New test.
* gcc.target/riscv/rvv/autovec/partial/slp_run-9.c: New test.

---
 gcc/config/riscv/riscv-v.cc   | 146 +-
 .../riscv/rvv/autovec/partial/slp-8.c |  30 
 .../riscv/rvv/autovec/partial/slp-9.c |  31 
 .../riscv/rvv/autovec/partial/slp_run-8.c |  30 
 .../riscv/rvv/autovec/partial/slp_run-9.c |  30 
 5 files changed, 260 insertions(+), 7 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp-8.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp-9.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp_run-8.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp_run-9.c

diff --git a/gcc/config/riscv/riscv-v.cc b/gcc/config/riscv/riscv-v.cc
index e1b85a5af91..3cea6b25261 100644
--- a/gcc/config/riscv/riscv-v.cc
+++ b/gcc/config/riscv/riscv-v.cc
@@ -836,6 +836,46 @@ emit_vlmax_masked_gather_mu_insn (rtx target, rtx op, rtx 
sel, rtx mask)
   emit_vlmax_masked_mu_insn (icode, RVV_BINOP_MU, ops);
 }
 
+/* According to RVV ISA spec (16.5.1. Synthesizing vdecompress):
+   https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc
+
+  There is no inverse vdecompress provided, as this operation can be readily
+  synthesized using iota and a masked vrgather:
+
+  Desired functionality of 'vdecompress'
+   7 6 5 4 3 2 1 0 # vid
+
+ e d c b a # packed vector of 5 elements
+   1 0 0 1 1 1 0 1 # mask vector of 8 elements
+   p q r s t u v w # destination register before vdecompress
+
+   e q r d c b v a # result of vdecompress
+   # v0 holds mask
+   # v1 holds packed data
+   # v11 holds input expanded vector and result
+   viota.m v10, v0 # Calc iota from mask in v0
+   vrgather.vv v11, v1, v10, v0.t  # Expand into destination
+ p q r s t u v w  # v11 destination register
+  e d c b a  # v1 source vector
+ 1 0 0 1 1 1 0 1  # v0 mask vector
+
+ 4 4 4 3 2 1 1 0  # v10 result of viota.m
+ e q r d c b v a  # v11 destination after vrgather using viota.m under mask
+*/
+static void
+emit_vlmax_decompress_insn (rtx target, rtx op, rtx mask)
+{
+  machine_mode data_mode = GET_MODE (target);
+  machine_mode sel_mode = related_int_vector_mode (data_mode).require ();
+  if (GET_MODE_INNER (data_mode) == QImode)
+sel_mode = get_vector_mode (HImode, GET_MODE_NUNITS (data_mode)).require 
();
+
+  rtx sel = gen_reg_rtx (sel_mode);
+  rtx iota_ops[] = {sel, mask};
+  emit_vlmax_insn (code_for_pred_iota (sel_mode), RVV_UNOP, iota_ops);
+  emit_vlmax_masked_gather_mu_insn (target, op, sel, mask);
+}
+
 /* Emit merge instruction.  */
 
 static machine_mode
@@ -934,14 +974,41 @@ expand_const_vector (rtx target, rtx src)
 {
   machine_mode mode = GET_MODE (target);
   scalar_mode elt_mode = GET_MODE_INNER (mode);
+  poly_uint64 nunits = GET_MODE_NUNITS (mode);
+  unsigned int nelts_per_pattern = CONST_VECTOR_NELTS_PER_PATTERN (src);
+  unsigned int npatterns = CONST_VECTOR_NPATTERNS (src);
   if (GET_MODE_CLASS (mode) == MODE_VECTOR_BOOL)
 {
   rtx elt;
-  gcc_assert (
-   const_vec_duplicate_p