Re: [committed] [RISC-V] Allow uarchs to set TARGET_OVERLAP_OP_BY_PIECES_P

2024-05-07 Thread Jeff Law




On 5/7/24 3:24 PM, Palmer Dabbelt wrote:


@@ -529,6 +536,7 @@ static const struct riscv_tune_param generic_ooo_tune_info 
= {
4,  /* fmv_cost */
false,  /* slow_unaligned_access */
false,  /* use_divmod_expansion */
+  false,   /* overlap_op_by_pieces */


IMO we should turn this on for the generic OOO tuning -- the benchmarks
say it's not faster for the T-Head OOO cores, but we were all so
surprised to find that I don't think we even fully trust the benchmarks.
I'd assume OOO cores are faster with the overlapping stores, so we
should just lean into it and let vendors say something if that's the
wrong assumption.
Several factors likely come into play (branch prediction, OOO 
properties, write combining, etc etc).


But sure, I don't think that'd be terribly controversial.  I can go 
ahead and make that change now given its triviality.


Jeff





Re: [committed] [RISC-V] Allow uarchs to set TARGET_OVERLAP_OP_BY_PIECES_P

2024-05-07 Thread Palmer Dabbelt
On Tue, 07 May 2024 14:18:36 PDT (-0700), Jeff Law wrote:
> This is almost exclusively work from the VRULL team.
>
> As we've discussed in the Tuesday meeting in the past, we'd like to have
> a knob in the tuning structure to indicate that overlapped stores during
> move_by_pieces expansion of memcpy & friends are acceptable.
>
> This patch adds the that capability in our tuning structure.  It's off
> for all the uarchs upstream, but we have been using it inside Ventana
> for our uarch with success.  So technically it's NFC upstream, but puts
> in the infrastructure multiple organizations likely need.
>
>
> Built and tested rv64gc.  Pushing to the trunk shortly.
> jeff
> commit 300393484dbfa9fd3891174ea47aa3fb41915abc
> Author: Christoph Müllner 
> Date:   Tue May 7 15:16:21 2024 -0600
>
> [committed] [RISC-V] Allow uarchs to set TARGET_OVERLAP_OP_BY_PIECES_P
>
> This is almost exclusively work from the VRULL team.
>
> As we've discussed in the Tuesday meeting in the past, we'd like to have 
> a knob
> in the tuning structure to indicate that overlapped stores during
> move_by_pieces expansion of memcpy & friends are acceptable.
>
> This patch adds the that capability in our tuning structure.  It's off 
> for all
> the uarchs upstream, but we have been using it inside Ventana for our 
> uarch
> with success.  So technically it's NFC upstream, but puts in the 
> infrastructure
> multiple organizations likely need.
>
> gcc/
>
> * config/riscv/riscv.cc (struct riscv_tune_param): Add new
> "overlap_op_by_pieces" field.
> (rocket_tune_info, sifive_7_tune_info): Set it.
> (sifive_p400_tune_info, sifive_p600_tune_info): Likewise.
> (thead_c906_tune_info, xiangshan_nanhu_tune_info): Likewise.
> (generic_ooo_tune_info, optimize_size_tune_info): Likewise.
> (riscv_overlap_op_by_pieces): New function.
> (TARGET_OVERLAP_OP_BY_PIECES_P): define.
>
> gcc/testsuite/
>
> * gcc.target/riscv/memcpy-nonoverlapping.c: New test.
> * gcc.target/riscv/memset-nonoverlapping.c: New test.
>
> diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
> index 545e68566dc..a9b57d41184 100644
> --- a/gcc/config/riscv/riscv.cc
> +++ b/gcc/config/riscv/riscv.cc
> @@ -288,6 +288,7 @@ struct riscv_tune_param
>unsigned short fmv_cost;
>bool slow_unaligned_access;
>bool use_divmod_expansion;
> +  bool overlap_op_by_pieces;
>unsigned int fusible_ops;
>const struct cpu_vector_cost *vec_costs;
>  };
> @@ -427,6 +428,7 @@ static const struct riscv_tune_param rocket_tune_info = {
>8, /* fmv_cost */
>true,  /* 
> slow_unaligned_access */
>false, /* use_divmod_expansion */
> +  false, /* overlap_op_by_pieces */
>RISCV_FUSE_NOTHING,   /* fusible_ops */
>NULL,  /* vector cost */
>  };
> @@ -444,6 +446,7 @@ static const struct riscv_tune_param sifive_7_tune_info = 
> {
>8, /* fmv_cost */
>true,  /* 
> slow_unaligned_access */
>false, /* use_divmod_expansion */
> +  false, /* overlap_op_by_pieces */
>RISCV_FUSE_NOTHING,   /* fusible_ops */
>NULL,  /* vector cost */
>  };
> @@ -461,6 +464,7 @@ static const struct riscv_tune_param 
> sifive_p400_tune_info = {
>4, /* fmv_cost */
>true,  /* 
> slow_unaligned_access */
>false, /* use_divmod_expansion */
> +  false, /* overlap_op_by_pieces */
>RISCV_FUSE_LUI_ADDI | RISCV_FUSE_AUIPC_ADDI,  /* fusible_ops */
>&generic_vector_cost,  /* vector cost */
>  };
> @@ -478,6 +482,7 @@ static const struct riscv_tune_param 
> sifive_p600_tune_info = {
>4, /* fmv_cost */
>true,  /* 
> slow_unaligned_access */
>false, /* use_divmod_expansion */
> +  false, /* overlap_op_by_pieces */
>RISC

[committed] [RISC-V] Allow uarchs to set TARGET_OVERLAP_OP_BY_PIECES_P

2024-05-07 Thread Jeff Law

This is almost exclusively work from the VRULL team.

As we've discussed in the Tuesday meeting in the past, we'd like to have 
a knob in the tuning structure to indicate that overlapped stores during 
move_by_pieces expansion of memcpy & friends are acceptable.


This patch adds the that capability in our tuning structure.  It's off 
for all the uarchs upstream, but we have been using it inside Ventana 
for our uarch with success.  So technically it's NFC upstream, but puts 
in the infrastructure multiple organizations likely need.



Built and tested rv64gc.  Pushing to the trunk shortly.
jeffcommit 300393484dbfa9fd3891174ea47aa3fb41915abc
Author: Christoph Müllner 
Date:   Tue May 7 15:16:21 2024 -0600

    [committed] [RISC-V] Allow uarchs to set TARGET_OVERLAP_OP_BY_PIECES_P

This is almost exclusively work from the VRULL team.

As we've discussed in the Tuesday meeting in the past, we'd like to have a 
knob
in the tuning structure to indicate that overlapped stores during
move_by_pieces expansion of memcpy & friends are acceptable.

This patch adds the that capability in our tuning structure.  It's off for 
all
the uarchs upstream, but we have been using it inside Ventana for our uarch
with success.  So technically it's NFC upstream, but puts in the 
infrastructure
multiple organizations likely need.

gcc/

* config/riscv/riscv.cc (struct riscv_tune_param): Add new
"overlap_op_by_pieces" field.
(rocket_tune_info, sifive_7_tune_info): Set it.
(sifive_p400_tune_info, sifive_p600_tune_info): Likewise.
(thead_c906_tune_info, xiangshan_nanhu_tune_info): Likewise.
(generic_ooo_tune_info, optimize_size_tune_info): Likewise.
(riscv_overlap_op_by_pieces): New function.
(TARGET_OVERLAP_OP_BY_PIECES_P): define.

gcc/testsuite/

* gcc.target/riscv/memcpy-nonoverlapping.c: New test.
* gcc.target/riscv/memset-nonoverlapping.c: New test.

diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
index 545e68566dc..a9b57d41184 100644
--- a/gcc/config/riscv/riscv.cc
+++ b/gcc/config/riscv/riscv.cc
@@ -288,6 +288,7 @@ struct riscv_tune_param
   unsigned short fmv_cost;
   bool slow_unaligned_access;
   bool use_divmod_expansion;
+  bool overlap_op_by_pieces;
   unsigned int fusible_ops;
   const struct cpu_vector_cost *vec_costs;
 };
@@ -427,6 +428,7 @@ static const struct riscv_tune_param rocket_tune_info = {
   8,   /* fmv_cost */
   true,/* 
slow_unaligned_access */
   false,   /* use_divmod_expansion */
+  false,   /* overlap_op_by_pieces */
   RISCV_FUSE_NOTHING,   /* fusible_ops */
   NULL,/* vector cost */
 };
@@ -444,6 +446,7 @@ static const struct riscv_tune_param sifive_7_tune_info = {
   8,   /* fmv_cost */
   true,/* 
slow_unaligned_access */
   false,   /* use_divmod_expansion */
+  false,   /* overlap_op_by_pieces */
   RISCV_FUSE_NOTHING,   /* fusible_ops */
   NULL,/* vector cost */
 };
@@ -461,6 +464,7 @@ static const struct riscv_tune_param sifive_p400_tune_info 
= {
   4,   /* fmv_cost */
   true,/* 
slow_unaligned_access */
   false,   /* use_divmod_expansion */
+  false,   /* overlap_op_by_pieces */
   RISCV_FUSE_LUI_ADDI | RISCV_FUSE_AUIPC_ADDI,  /* fusible_ops */
   &generic_vector_cost,/* vector cost */
 };
@@ -478,6 +482,7 @@ static const struct riscv_tune_param sifive_p600_tune_info 
= {
   4,   /* fmv_cost */
   true,/* 
slow_unaligned_access */
   false,   /* use_divmod_expansion */
+  false,   /* overlap_op_by_pieces */
   RISCV_FUSE_LUI_ADDI | RISCV_FUSE_AUIPC_ADDI,  /* fusible_ops */
   &generic_vector_cost,/* vector cost */
 };
@@ -495,6 +500,7 @@ static const struct riscv_tune_param thead_c906_tune_info = 
{
   8,   /* fmv_cost */
   false,/* slow_unaligned_access */
   false,   /* use_divmod_expansion */
+  false,   /* overlap_op_by_pieces */
   RISCV_FUSE_NOTHING,   /* fusible_ops */
   NULL,