Re: [PATCH 2/2] RISC-V: Add cmpmemsi expansion

2024-05-15 Thread Christoph Müllner
On Thu, May 9, 2024 at 4:50 PM Jeff Law  wrote:
>
>
>
> On 5/7/24 11:52 PM, Christoph Müllner wrote:
> > GCC has a generic cmpmemsi expansion via the by-pieces framework,
> > which shows some room for target-specific optimizations.
> > E.g. for comparing two aligned memory blocks of 15 bytes
> > we get the following sequence:
> >
> > my_mem_cmp_aligned_15:
> >  li  a4,0
> >  j   .L2
> > .L8:
> >  bgeua4,a7,.L7
> > .L2:
> >  add a2,a0,a4
> >  add a3,a1,a4
> >  lbu a5,0(a2)
> >  lbu a6,0(a3)
> >  addia4,a4,1
> >  li  a7,15// missed hoisting
> >  subwa5,a5,a6
> >  andia5,a5,0xff // useless
> >  beq a5,zero,.L8
> >  lbu a0,0(a2) // loading again!
> >  lbu a5,0(a3) // loading again!
> >  subwa0,a0,a5
> >  ret
> > .L7:
> >  li  a0,0
> >  ret
> >
> > Diff first byte: 15 insns
> > Diff second byte: 25 insns
> > No diff: 25 insns
> >
> > Possible improvements:
> > * unroll the loop and use load-with-displacement to avoid offset increments
> > * load and compare multiple (aligned) bytes at once
> > * Use the bitmanip/strcmp result calculation (reverse words and
> >synthesize (a2 >= a3) ? 1 : -1 in a branchless sequence)
> >
> > When applying these improvements we get the following sequence:
> >
> > my_mem_cmp_aligned_15:
> >  ld  a5,0(a0)
> >  ld  a4,0(a1)
> >  bne a5,a4,.L2
> >  ld  a5,8(a0)
> >  ld  a4,8(a1)
> >  sllia5,a5,8
> >  sllia4,a4,8
> >  bne a5,a4,.L2
> >  li  a0,0
> > .L3:
> >  sext.w  a0,a0
> >  ret
> > .L2:
> >  rev8a5,a5
> >  rev8a4,a4
> >  sltua5,a5,a4
> >  neg a5,a5
> >  ori a0,a5,1
> >  j   .L3
> >
> > Diff first byte: 11 insns
> > Diff second byte: 16 insns
> > No diff: 11 insns
> >
> > This patch implements this improvements.
> >
> > The tests consist of a execution test (similar to
> > gcc/testsuite/gcc.dg/torture/inline-mem-cmp-1.c) and a few tests
> > that test the expansion conditions (known length and alignment).
> >
> > Similar to the cpymemsi expansion this patch does not introduce any
> > gating for the cmpmemsi expansion (on top of requiring the known length,
> > alignment and Zbb).
> >
> > Bootstrapped and SPEC CPU 2017 tested.
> >
> > gcc/ChangeLog:
> >
> >   * config/riscv/riscv-protos.h (riscv_expand_block_compare): New
> >   prototype.
> >   * config/riscv/riscv-string.cc (GEN_EMIT_HELPER2): New helper.
> >   (do_load_from_addr): Add support for HI and SI/64 modes.
> >   (emit_memcmp_scalar_load_and_compare): New helper to emit memcmp.
> >   (emit_memcmp_scalar_result_calculation): Likewise.
> >   (riscv_expand_block_compare_scalar): Likewise.
> >   (riscv_expand_block_compare): New RISC-V expander for memory compare.
> >   * config/riscv/riscv.md (cmpmemsi): New cmpmem expansion.
> >
> > gcc/testsuite/ChangeLog:
> >
> >   * gcc.target/riscv/cmpmemsi-1.c: New test.
> >   * gcc.target/riscv/cmpmemsi-2.c: New test.
> >   * gcc.target/riscv/cmpmemsi-3.c: New test.
> >   * gcc.target/riscv/cmpmemsi.c: New test.
> >
> > Signed-off-by: Christoph Müllner 
> > ---
> >   gcc/config/riscv/riscv-protos.h |   1 +
> >   gcc/config/riscv/riscv-string.cc| 161 
> >   gcc/config/riscv/riscv.md   |  15 ++
> >   gcc/testsuite/gcc.target/riscv/cmpmemsi-1.c |   6 +
> >   gcc/testsuite/gcc.target/riscv/cmpmemsi-2.c |  42 +
> >   gcc/testsuite/gcc.target/riscv/cmpmemsi-3.c |  43 ++
> >   gcc/testsuite/gcc.target/riscv/cmpmemsi.c   |  22 +++
> >   7 files changed, 290 insertions(+)
> >   create mode 100644 gcc/testsuite/gcc.target/riscv/cmpmemsi-1.c
> >   create mode 100644 gcc/testsuite/gcc.target/riscv/cmpmemsi-2.c
> >   create mode 100644 gcc/testsuite/gcc.target/riscv/cmpmemsi-3.c
> >   create mode 100644 gcc/testsuite/gcc.target/riscv/cmpmemsi.c
> >
> > diff --git a/gcc/config/riscv/riscv-protos.h 
> > b/gcc/config/riscv/riscv-protos.h
> > index e5aebf3fc3d..30ffe30be1d 100644
> > --- a/gcc/config/riscv/riscv-protos.h
> > +++ b/gcc/config/riscv/riscv-protos.h
> > @@ -188,6 +188,7 @@ rtl_opt_pass * make_pass_avlprop (gcc::context *ctxt);
> >   rtl_opt_pass * make_pass_vsetvl (gcc::context *ctxt);
> >
> >   /* Routines implemented in riscv-string.c.  */
> > +extern bool riscv_expand_block_compare (rtx, rtx, rtx, rtx);
> >   extern bool riscv_expand_block_move (rtx, rtx, rtx);
> >
> >   /* Information about one CPU we know about.  */
> > diff --git a/gcc/config/riscv/riscv-string.cc 
> > b/gcc/config/riscv/riscv-string.cc
> > index b09b51d7526..9d4dc0cb827 100644
> > --- a/gcc/config/riscv/riscv-string.cc
> > +++ b/gcc/config/riscv/riscv-string.cc
> > @@ -86,6 +86,7 @@ 

Re: [PATCH 2/2] RISC-V: Add cmpmemsi expansion

2024-05-09 Thread Jeff Law




On 5/7/24 11:52 PM, Christoph Müllner wrote:

GCC has a generic cmpmemsi expansion via the by-pieces framework,
which shows some room for target-specific optimizations.
E.g. for comparing two aligned memory blocks of 15 bytes
we get the following sequence:

my_mem_cmp_aligned_15:
 li  a4,0
 j   .L2
.L8:
 bgeua4,a7,.L7
.L2:
 add a2,a0,a4
 add a3,a1,a4
 lbu a5,0(a2)
 lbu a6,0(a3)
 addia4,a4,1
 li  a7,15// missed hoisting
 subwa5,a5,a6
 andia5,a5,0xff // useless
 beq a5,zero,.L8
 lbu a0,0(a2) // loading again!
 lbu a5,0(a3) // loading again!
 subwa0,a0,a5
 ret
.L7:
 li  a0,0
 ret

Diff first byte: 15 insns
Diff second byte: 25 insns
No diff: 25 insns

Possible improvements:
* unroll the loop and use load-with-displacement to avoid offset increments
* load and compare multiple (aligned) bytes at once
* Use the bitmanip/strcmp result calculation (reverse words and
   synthesize (a2 >= a3) ? 1 : -1 in a branchless sequence)

When applying these improvements we get the following sequence:

my_mem_cmp_aligned_15:
 ld  a5,0(a0)
 ld  a4,0(a1)
 bne a5,a4,.L2
 ld  a5,8(a0)
 ld  a4,8(a1)
 sllia5,a5,8
 sllia4,a4,8
 bne a5,a4,.L2
 li  a0,0
.L3:
 sext.w  a0,a0
 ret
.L2:
 rev8a5,a5
 rev8a4,a4
 sltua5,a5,a4
 neg a5,a5
 ori a0,a5,1
 j   .L3

Diff first byte: 11 insns
Diff second byte: 16 insns
No diff: 11 insns

This patch implements this improvements.

The tests consist of a execution test (similar to
gcc/testsuite/gcc.dg/torture/inline-mem-cmp-1.c) and a few tests
that test the expansion conditions (known length and alignment).

Similar to the cpymemsi expansion this patch does not introduce any
gating for the cmpmemsi expansion (on top of requiring the known length,
alignment and Zbb).

Bootstrapped and SPEC CPU 2017 tested.

gcc/ChangeLog:

* config/riscv/riscv-protos.h (riscv_expand_block_compare): New
prototype.
* config/riscv/riscv-string.cc (GEN_EMIT_HELPER2): New helper.
(do_load_from_addr): Add support for HI and SI/64 modes.
(emit_memcmp_scalar_load_and_compare): New helper to emit memcmp.
(emit_memcmp_scalar_result_calculation): Likewise.
(riscv_expand_block_compare_scalar): Likewise.
(riscv_expand_block_compare): New RISC-V expander for memory compare.
* config/riscv/riscv.md (cmpmemsi): New cmpmem expansion.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/cmpmemsi-1.c: New test.
* gcc.target/riscv/cmpmemsi-2.c: New test.
* gcc.target/riscv/cmpmemsi-3.c: New test.
* gcc.target/riscv/cmpmemsi.c: New test.

Signed-off-by: Christoph Müllner 
---
  gcc/config/riscv/riscv-protos.h |   1 +
  gcc/config/riscv/riscv-string.cc| 161 
  gcc/config/riscv/riscv.md   |  15 ++
  gcc/testsuite/gcc.target/riscv/cmpmemsi-1.c |   6 +
  gcc/testsuite/gcc.target/riscv/cmpmemsi-2.c |  42 +
  gcc/testsuite/gcc.target/riscv/cmpmemsi-3.c |  43 ++
  gcc/testsuite/gcc.target/riscv/cmpmemsi.c   |  22 +++
  7 files changed, 290 insertions(+)
  create mode 100644 gcc/testsuite/gcc.target/riscv/cmpmemsi-1.c
  create mode 100644 gcc/testsuite/gcc.target/riscv/cmpmemsi-2.c
  create mode 100644 gcc/testsuite/gcc.target/riscv/cmpmemsi-3.c
  create mode 100644 gcc/testsuite/gcc.target/riscv/cmpmemsi.c

diff --git a/gcc/config/riscv/riscv-protos.h b/gcc/config/riscv/riscv-protos.h
index e5aebf3fc3d..30ffe30be1d 100644
--- a/gcc/config/riscv/riscv-protos.h
+++ b/gcc/config/riscv/riscv-protos.h
@@ -188,6 +188,7 @@ rtl_opt_pass * make_pass_avlprop (gcc::context *ctxt);
  rtl_opt_pass * make_pass_vsetvl (gcc::context *ctxt);
  
  /* Routines implemented in riscv-string.c.  */

+extern bool riscv_expand_block_compare (rtx, rtx, rtx, rtx);
  extern bool riscv_expand_block_move (rtx, rtx, rtx);
  
  /* Information about one CPU we know about.  */

diff --git a/gcc/config/riscv/riscv-string.cc b/gcc/config/riscv/riscv-string.cc
index b09b51d7526..9d4dc0cb827 100644
--- a/gcc/config/riscv/riscv-string.cc
+++ b/gcc/config/riscv/riscv-string.cc
@@ -86,6 +86,7 @@ GEN_EMIT_HELPER2(th_rev) /* do_th_rev2  */
  GEN_EMIT_HELPER2(th_tstnbz) /* do_th_tstnbz2  */
  GEN_EMIT_HELPER3(xor) /* do_xor3  */
  GEN_EMIT_HELPER2(zero_extendqi) /* do_zero_extendqi2  */
+GEN_EMIT_HELPER2(zero_extendhi) /* do_zero_extendhi2  */
  
  #undef GEN_EMIT_HELPER2

  #undef GEN_EMIT_HELPER3
@@ -109,6 +110,10 @@ do_load_from_addr (machine_mode mode, rtx dest, rtx 
addr_reg, rtx addr)
  
if (mode == QImode)

  do_zero_extendqi2 (dest, mem);
+  else if (mode == HImode)
+do_zero_extendhi2 (dest, mem);
+  

[PATCH 2/2] RISC-V: Add cmpmemsi expansion

2024-05-07 Thread Christoph Müllner
GCC has a generic cmpmemsi expansion via the by-pieces framework,
which shows some room for target-specific optimizations.
E.g. for comparing two aligned memory blocks of 15 bytes
we get the following sequence:

my_mem_cmp_aligned_15:
li  a4,0
j   .L2
.L8:
bgeua4,a7,.L7
.L2:
add a2,a0,a4
add a3,a1,a4
lbu a5,0(a2)
lbu a6,0(a3)
addia4,a4,1
li  a7,15// missed hoisting
subwa5,a5,a6
andia5,a5,0xff // useless
beq a5,zero,.L8
lbu a0,0(a2) // loading again!
lbu a5,0(a3) // loading again!
subwa0,a0,a5
ret
.L7:
li  a0,0
ret

Diff first byte: 15 insns
Diff second byte: 25 insns
No diff: 25 insns

Possible improvements:
* unroll the loop and use load-with-displacement to avoid offset increments
* load and compare multiple (aligned) bytes at once
* Use the bitmanip/strcmp result calculation (reverse words and
  synthesize (a2 >= a3) ? 1 : -1 in a branchless sequence)

When applying these improvements we get the following sequence:

my_mem_cmp_aligned_15:
ld  a5,0(a0)
ld  a4,0(a1)
bne a5,a4,.L2
ld  a5,8(a0)
ld  a4,8(a1)
sllia5,a5,8
sllia4,a4,8
bne a5,a4,.L2
li  a0,0
.L3:
sext.w  a0,a0
ret
.L2:
rev8a5,a5
rev8a4,a4
sltua5,a5,a4
neg a5,a5
ori a0,a5,1
j   .L3

Diff first byte: 11 insns
Diff second byte: 16 insns
No diff: 11 insns

This patch implements this improvements.

The tests consist of a execution test (similar to
gcc/testsuite/gcc.dg/torture/inline-mem-cmp-1.c) and a few tests
that test the expansion conditions (known length and alignment).

Similar to the cpymemsi expansion this patch does not introduce any
gating for the cmpmemsi expansion (on top of requiring the known length,
alignment and Zbb).

Bootstrapped and SPEC CPU 2017 tested.

gcc/ChangeLog:

* config/riscv/riscv-protos.h (riscv_expand_block_compare): New
prototype.
* config/riscv/riscv-string.cc (GEN_EMIT_HELPER2): New helper.
(do_load_from_addr): Add support for HI and SI/64 modes.
(emit_memcmp_scalar_load_and_compare): New helper to emit memcmp.
(emit_memcmp_scalar_result_calculation): Likewise.
(riscv_expand_block_compare_scalar): Likewise.
(riscv_expand_block_compare): New RISC-V expander for memory compare.
* config/riscv/riscv.md (cmpmemsi): New cmpmem expansion.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/cmpmemsi-1.c: New test.
* gcc.target/riscv/cmpmemsi-2.c: New test.
* gcc.target/riscv/cmpmemsi-3.c: New test.
* gcc.target/riscv/cmpmemsi.c: New test.

Signed-off-by: Christoph Müllner 
---
 gcc/config/riscv/riscv-protos.h |   1 +
 gcc/config/riscv/riscv-string.cc| 161 
 gcc/config/riscv/riscv.md   |  15 ++
 gcc/testsuite/gcc.target/riscv/cmpmemsi-1.c |   6 +
 gcc/testsuite/gcc.target/riscv/cmpmemsi-2.c |  42 +
 gcc/testsuite/gcc.target/riscv/cmpmemsi-3.c |  43 ++
 gcc/testsuite/gcc.target/riscv/cmpmemsi.c   |  22 +++
 7 files changed, 290 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/riscv/cmpmemsi-1.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/cmpmemsi-2.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/cmpmemsi-3.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/cmpmemsi.c

diff --git a/gcc/config/riscv/riscv-protos.h b/gcc/config/riscv/riscv-protos.h
index e5aebf3fc3d..30ffe30be1d 100644
--- a/gcc/config/riscv/riscv-protos.h
+++ b/gcc/config/riscv/riscv-protos.h
@@ -188,6 +188,7 @@ rtl_opt_pass * make_pass_avlprop (gcc::context *ctxt);
 rtl_opt_pass * make_pass_vsetvl (gcc::context *ctxt);
 
 /* Routines implemented in riscv-string.c.  */
+extern bool riscv_expand_block_compare (rtx, rtx, rtx, rtx);
 extern bool riscv_expand_block_move (rtx, rtx, rtx);
 
 /* Information about one CPU we know about.  */
diff --git a/gcc/config/riscv/riscv-string.cc b/gcc/config/riscv/riscv-string.cc
index b09b51d7526..9d4dc0cb827 100644
--- a/gcc/config/riscv/riscv-string.cc
+++ b/gcc/config/riscv/riscv-string.cc
@@ -86,6 +86,7 @@ GEN_EMIT_HELPER2(th_rev) /* do_th_rev2  */
 GEN_EMIT_HELPER2(th_tstnbz) /* do_th_tstnbz2  */
 GEN_EMIT_HELPER3(xor) /* do_xor3  */
 GEN_EMIT_HELPER2(zero_extendqi) /* do_zero_extendqi2  */
+GEN_EMIT_HELPER2(zero_extendhi) /* do_zero_extendhi2  */
 
 #undef GEN_EMIT_HELPER2
 #undef GEN_EMIT_HELPER3
@@ -109,6 +110,10 @@ do_load_from_addr (machine_mode mode, rtx dest, rtx 
addr_reg, rtx addr)
 
   if (mode == QImode)
 do_zero_extendqi2 (dest, mem);
+  else if (mode == HImode)
+do_zero_extendhi2 (dest, mem);
+  else if (mode == SImode && TARGET_64BIT)
+emit_insn (gen_zero_extendsidi2 (dest, mem));
   else if (mode == Xmode)