Hi Wang Yaduo,
> Add per-type reduction costs (i8/i16/i32/i64/f16/f32/f64) to the RISC-V
> vector cost model, distinguishing between ordered (fold-left) and
> unordered (tree) floating-point reductions. When a reduction is
> detected, the per-type cost replaces the default vec_to_scalar_cost,
> similar to AArch64. This causes _Float16 n=4 ordered reductions to no
> longer be vectorized in VLS mode due to the higher cost.
>
> gcc/ChangeLog:
>
> * config/riscv/riscv-protos.h (common_vector_cost): Add per-type
> reduction cost fields: reduc_i8_cost, reduc_i16_cost,
> reduc_i32_cost, reduc_i64_cost, reduc_f16_cost, reduc_f32_cost,
> reduc_f64_cost for unordered reductions, and reduc_f16_ordered_cost,
> reduc_f32_ordered_cost, reduc_f64_ordered_cost for ordered
> (fold-left) reductions.
> * config/riscv/riscv.cc (rvv_vla_vector_cost): Initialize reduction
> cost fields with default values.
> (rvv_vls_vector_cost): Likewise.
> * config/riscv/riscv-vector-costs.cc (costs::adjust_stmt_cost): Add
> reduction detection in the vec_to_scalar case. When a reduction is
> detected, replace the default vec_to_scalar_cost with the
> appropriate per-type reduction cost based on element mode and
> reduction kind (ordered vs unordered).
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/riscv/rvv/autovec/reduc/reduc_cost-1.c: New test for
> VLA unordered reduction costs.
> * gcc.target/riscv/rvv/autovec/reduc/reduc_cost-2.c: New test for
> VLA ordered reduction costs.
> * gcc.target/riscv/rvv/autovec/vls/reduc_cost-1.c: New test for
> VLS reduction costs.
> * gcc.target/riscv/rvv/autovec/vls/reduc-19.c: Update expected
> vfredosum count from 9 to 8.
> * gcc.target/riscv/rvv/autovec/vls/wred-3.c: Update expected
> vfwredosum count from 17 to 16.
>
> Signed-off-by: Wang Yaduo <[email protected]>
> ---
> gcc/config/riscv/riscv-protos.h | 20 +++++-
> gcc/config/riscv/riscv-vector-costs.cc | 68 ++++++++++++++++++-
> gcc/config/riscv/riscv.cc | 20 ++++++
> .../riscv/rvv/autovec/reduc/reduc_cost-1.c | 34 ++++++++++
> .../riscv/rvv/autovec/reduc/reduc_cost-2.c | 34 ++++++++++
> .../riscv/rvv/autovec/vls/reduc-19.c | 4 +-
> .../riscv/rvv/autovec/vls/reduc_cost-1.c | 41 +++++++++++
> .../gcc.target/riscv/rvv/autovec/vls/wred-3.c | 4 +-
> 8 files changed, 219 insertions(+), 6 deletions(-)
> create mode 100644
> gcc/testsuite/gcc.target/riscv/rvv/autovec/reduc/reduc_cost-1.c
> create mode 100644
> gcc/testsuite/gcc.target/riscv/rvv/autovec/reduc/reduc_cost-2.c
> create mode 100644
> gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/reduc_cost-1.c
>
> diff --git a/gcc/config/riscv/riscv-protos.h b/gcc/config/riscv/riscv-protos.h
> index dd029c704..5da5a6a21 100644
> --- a/gcc/config/riscv/riscv-protos.h
> +++ b/gcc/config/riscv/riscv-protos.h
> @@ -279,6 +279,24 @@ struct common_vector_cost
>
> /* Cost of an unaligned vector store. */
> const int unalign_store_cost;
> +
> + /* Cost of vector reduction operations (unordered / tree reduction).
> + Indexed by element type. */
> + const int reduc_i8_cost;
> + const int reduc_i16_cost;
> + const int reduc_i32_cost;
> + const int reduc_i64_cost;
> + const int reduc_f16_cost;
> + const int reduc_f32_cost;
> + const int reduc_f64_cost;
Do we need all of those? I'm not sure but given that they are supposed to be
implemented as tree reductions, the latency should not vary too much WRT the
element size?
> +
> + /* Cost of ordered (fold-left / strict) floating-point reductions.
> + These are significantly more expensive than unordered (tree) reductions
> + because RVV ordered reduction instructions (e.g. vfredosum) process
> + elements sequentially. */
> + const int reduc_f16_ordered_cost;
> + const int reduc_f32_ordered_cost;
> + const int reduc_f64_ordered_cost;
Same here, I'm not entirely sure and uarchs might vary (wildly) but generally
these should scale linearly with the number of elements so perhaps once factor
is enough? Open for debate, though.
> /* scalable vectorization (VLA) specific cost. */
> @@ -289,7 +307,7 @@ struct scalable_vector_cost : common_vector_cost
> {}
>
> /* TODO: We will need more other kinds of vector cost for VLA.
> - E.g. fold_left reduction cost, lanes load/store cost, ..., etc. */
> + E.g. lanes load/store cost, ..., etc. */
> };
We have lane cost, so this comment can be removed.
> --- a/gcc/config/riscv/riscv.cc
> +++ b/gcc/config/riscv/riscv.cc
> @@ -415,6 +415,16 @@ static const common_vector_cost rvv_vls_vector_cost = {
> 1, /* align_store_cost */
> 2, /* unalign_load_cost */
> 2, /* unalign_store_cost */
> + 2, /* reduc_i8_cost */
> + 2, /* reduc_i16_cost */
> + 2, /* reduc_i32_cost */
> + 2, /* reduc_i64_cost */
> + 2, /* reduc_f16_cost */
> + 2, /* reduc_f32_cost */
> + 2, /* reduc_f64_cost */
> + 6, /* reduc_f16_ordered_cost */
> + 4, /* reduc_f32_ordered_cost */
> + 2, /* reduc_f64_ordered_cost */
> };
Any reason why the scaling is not *2 but rather +2? I'd have expected twice
the work (and thus, latency) for 2x elements. Also, even 2-6 seem rather low
compared to regular reductions? Looking at the published Ascalon X numbers,
it's more like 5, 10, 20.
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/reduc/reduc_cost-1.c
> b/gcc/testsuite/gcc.target/riscv/rvv/autovec/reduc/reduc_cost-1.c
Distinct cost-model tests are better put into the costmodel sub directory.
> #include "wred-2.c"
> -/* { dg-final { scan-assembler-times {vfwredosum\.vs} 17 } } */
> +/* The _Float16->float n=4 case is not vectorized because the ordered
> + reduction cost makes it unprofitable for small trip counts. */
> +/* { dg-final { scan-assembler-times {vfwredosum\.vs} 16 } } */
This is supposed to test functionality so I'd rather keep the expectation and
add -fno-vect-cost-model.
--
Regards
Robin