Re: TARGET_RTX_COSTS and pipeline latency vs. variable-latency instructions (was Re: [PATCH] RISC-V: Add XiangShan Nanhu microarchitecture.)

2024-03-25 Thread Jeff Law




On 3/25/24 2:57 PM, Palmer Dabbelt wrote:

On Mon, 25 Mar 2024 13:49:18 PDT (-0700), jeffreya...@gmail.com wrote:



On 3/25/24 2:31 PM, Palmer Dabbelt wrote:

On Mon, 25 Mar 2024 13:27:34 PDT (-0700), Jeff Law wrote:


I'd doubt it's worth the complexity.  Picking some reasonable value 
gets

you the vast majority of the benefit.   Something like
COSTS_N_INSNS(6) is enough to get CSE to trigger.  So what's left is a
reasonable cost, particularly for the division-by-constant case 
where we

need a ceiling for synth_mult.


Ya, makes sense.  I noticed our multi-word multiply costs are a bit odd
too (they really only work for 64-bit mul on 32-bit targets), but that's
probably not worth worrying about either.

We do have a changes locally that adjust various costs.  One of which is
highpart multiply.  One of the many things to start working through once
gcc-15 opens for development.  Hence my desire to help keep gcc-14 on
track for an on-time release.


Cool.  LMK if there's anything we can do to help on that front.
I think the RISC-V space is in pretty good shape.   Most of the issues 
left are either generic or hitting other targets.  While the number of 
P1s has been flat or rising, that's more an artifact of bug 
triage/reprioritization process that's ongoing.  I can only speak for 
myself, but the progress in nailing down the slew of bugs thrown into 
the P1 bucket over the last few weeks has been great IMHO.


jeff


Re: TARGET_RTX_COSTS and pipeline latency vs. variable-latency instructions (was Re: [PATCH] RISC-V: Add XiangShan Nanhu microarchitecture.)

2024-03-25 Thread Palmer Dabbelt

On Mon, 25 Mar 2024 13:49:18 PDT (-0700), jeffreya...@gmail.com wrote:



On 3/25/24 2:31 PM, Palmer Dabbelt wrote:

On Mon, 25 Mar 2024 13:27:34 PDT (-0700), Jeff Law wrote:



I'd doubt it's worth the complexity.  Picking some reasonable value gets
you the vast majority of the benefit.   Something like
COSTS_N_INSNS(6) is enough to get CSE to trigger.  So what's left is a
reasonable cost, particularly for the division-by-constant case where we
need a ceiling for synth_mult.


Ya, makes sense.  I noticed our multi-word multiply costs are a bit odd
too (they really only work for 64-bit mul on 32-bit targets), but that's
probably not worth worrying about either.

We do have a changes locally that adjust various costs.  One of which is
highpart multiply.  One of the many things to start working through once
gcc-15 opens for development.  Hence my desire to help keep gcc-14 on
track for an on-time release.


Cool.  LMK if there's anything we can do to help on that front.



Jeff


Re: TARGET_RTX_COSTS and pipeline latency vs. variable-latency instructions (was Re: [PATCH] RISC-V: Add XiangShan Nanhu microarchitecture.)

2024-03-25 Thread Jeff Law




On 3/25/24 2:31 PM, Palmer Dabbelt wrote:

On Mon, 25 Mar 2024 13:27:34 PDT (-0700), Jeff Law wrote:



I'd doubt it's worth the complexity.  Picking some reasonable value gets
you the vast majority of the benefit.   Something like
COSTS_N_INSNS(6) is enough to get CSE to trigger.  So what's left is a
reasonable cost, particularly for the division-by-constant case where we
need a ceiling for synth_mult.


Ya, makes sense.  I noticed our multi-word multiply costs are a bit odd 
too (they really only work for 64-bit mul on 32-bit targets), but that's 
probably not worth worrying about either.
We do have a changes locally that adjust various costs.  One of which is 
highpart multiply.  One of the many things to start working through once 
gcc-15 opens for development.  Hence my desire to help keep gcc-14 on 
track for an on-time release.


Jeff


Re: TARGET_RTX_COSTS and pipeline latency vs. variable-latency instructions (was Re: [PATCH] RISC-V: Add XiangShan Nanhu microarchitecture.)

2024-03-25 Thread Palmer Dabbelt

On Mon, 25 Mar 2024 13:27:34 PDT (-0700), Jeff Law wrote:



On 3/25/24 2:13 PM, Palmer Dabbelt wrote:

On Mon, 25 Mar 2024 12:59:14 PDT (-0700), Jeff Law wrote:



On 3/25/24 1:48 PM, Xi Ruoyao wrote:

On Mon, 2024-03-18 at 20:54 -0600, Jeff Law wrote:

+/* Costs to use when optimizing for xiangshan nanhu.  */
+static const struct riscv_tune_param xiangshan_nanhu_tune_info = {
+  {COSTS_N_INSNS (3), COSTS_N_INSNS (3)},    /* fp_add */
+  {COSTS_N_INSNS (3), COSTS_N_INSNS (3)},    /* fp_mul */
+  {COSTS_N_INSNS (10), COSTS_N_INSNS (20)},    /* fp_div */
+  {COSTS_N_INSNS (3), COSTS_N_INSNS (3)},    /* int_mul */
+  {COSTS_N_INSNS (6), COSTS_N_INSNS (6)},    /* int_div */
+  6,    /* issue_rate */
+  3,    /* branch_cost */
+  3,    /* memory_cost */
+  3,    /* fmv_cost */
+  true,    /* slow_unaligned_access */
+  false,    /* use_divmod_expansion */
+  RISCV_FUSE_ZEXTW | RISCV_FUSE_ZEXTH,  /* fusible_ops */
+  NULL,    /* vector cost */



Is your integer division really that fast?  The table above essentially
says that your cpu can do integer division in 6 cycles.


Hmm, I just seen I've coded some even smaller value for LoongArch CPUs
so forgive me for "hijacking" this thread...

The problem seems integer division may spend different number of cycles
for different inputs: on LoongArch LA664 I've observed 5 cycles for some
inputs and 39 cycles for other inputs.

So should we use the minimal value, the maximum value, or something in-
between for TARGET_RTX_COSTS and pipeline descriptions?

Yea, early outs are relatively common in the actual hardware
implementation.

The biggest reason to refine the cost of a division is so that we've got
a reasonably accurate cost for division by a constant -- which can often
be done with multiplication by reciprocal sequence.  The multiplication
by reciprocal sequence will use mult, add, sub, shadd insns and you need
a reasonable cost model for those so you can compare against the cost of
a hardware division.

So to answer your question.  Choose something sensible, you probably
don't want the fastest case and you may not want the slowest case.


Maybe we should have some sort of per-bit-set cost hook for mul/div?
Without that we're kind of just guessing at whether the implmentation
has early outs based on hueristics used to implicitly generate the cost
models.

Not sure that's really worth the complexity, though...

I'd doubt it's worth the complexity.  Picking some reasonable value gets
you the vast majority of the benefit.   Something like
COSTS_N_INSNS(6) is enough to get CSE to trigger.  So what's left is a
reasonable cost, particularly for the division-by-constant case where we
need a ceiling for synth_mult.


Ya, makes sense.  I noticed our multi-word multiply costs are a bit odd 
too (they really only work for 64-bit mul on 32-bit targets), but that's 
probably not worth worrying about either.




Jeff


Re: TARGET_RTX_COSTS and pipeline latency vs. variable-latency instructions (was Re: [PATCH] RISC-V: Add XiangShan Nanhu microarchitecture.)

2024-03-25 Thread Jeff Law




On 3/25/24 2:13 PM, Palmer Dabbelt wrote:

On Mon, 25 Mar 2024 12:59:14 PDT (-0700), Jeff Law wrote:



On 3/25/24 1:48 PM, Xi Ruoyao wrote:

On Mon, 2024-03-18 at 20:54 -0600, Jeff Law wrote:

+/* Costs to use when optimizing for xiangshan nanhu.  */
+static const struct riscv_tune_param xiangshan_nanhu_tune_info = {
+  {COSTS_N_INSNS (3), COSTS_N_INSNS (3)},    /* fp_add */
+  {COSTS_N_INSNS (3), COSTS_N_INSNS (3)},    /* fp_mul */
+  {COSTS_N_INSNS (10), COSTS_N_INSNS (20)},    /* fp_div */
+  {COSTS_N_INSNS (3), COSTS_N_INSNS (3)},    /* int_mul */
+  {COSTS_N_INSNS (6), COSTS_N_INSNS (6)},    /* int_div */
+  6,    /* issue_rate */
+  3,    /* branch_cost */
+  3,    /* memory_cost */
+  3,    /* fmv_cost */
+  true,    /* slow_unaligned_access */
+  false,    /* use_divmod_expansion */
+  RISCV_FUSE_ZEXTW | RISCV_FUSE_ZEXTH,  /* fusible_ops */
+  NULL,    /* vector cost */



Is your integer division really that fast?  The table above essentially
says that your cpu can do integer division in 6 cycles.


Hmm, I just seen I've coded some even smaller value for LoongArch CPUs
so forgive me for "hijacking" this thread...

The problem seems integer division may spend different number of cycles
for different inputs: on LoongArch LA664 I've observed 5 cycles for some
inputs and 39 cycles for other inputs.

So should we use the minimal value, the maximum value, or something in-
between for TARGET_RTX_COSTS and pipeline descriptions?

Yea, early outs are relatively common in the actual hardware
implementation.

The biggest reason to refine the cost of a division is so that we've got
a reasonably accurate cost for division by a constant -- which can often
be done with multiplication by reciprocal sequence.  The multiplication
by reciprocal sequence will use mult, add, sub, shadd insns and you need
a reasonable cost model for those so you can compare against the cost of
a hardware division.

So to answer your question.  Choose something sensible, you probably
don't want the fastest case and you may not want the slowest case.


Maybe we should have some sort of per-bit-set cost hook for mul/div? 
Without that we're kind of just guessing at whether the implmentation 
has early outs based on hueristics used to implicitly generate the cost 
models.


Not sure that's really worth the complexity, though...
I'd doubt it's worth the complexity.  Picking some reasonable value gets 
you the vast majority of the benefit.   Something like
COSTS_N_INSNS(6) is enough to get CSE to trigger.  So what's left is a 
reasonable cost, particularly for the division-by-constant case where we 
need a ceiling for synth_mult.


Jeff


Re: TARGET_RTX_COSTS and pipeline latency vs. variable-latency instructions (was Re: [PATCH] RISC-V: Add XiangShan Nanhu microarchitecture.)

2024-03-25 Thread Palmer Dabbelt

On Mon, 25 Mar 2024 12:59:14 PDT (-0700), Jeff Law wrote:



On 3/25/24 1:48 PM, Xi Ruoyao wrote:

On Mon, 2024-03-18 at 20:54 -0600, Jeff Law wrote:

+/* Costs to use when optimizing for xiangshan nanhu.  */
+static const struct riscv_tune_param xiangshan_nanhu_tune_info = {
+  {COSTS_N_INSNS (3), COSTS_N_INSNS (3)},  /* fp_add */
+  {COSTS_N_INSNS (3), COSTS_N_INSNS (3)},  /* fp_mul */
+  {COSTS_N_INSNS (10), COSTS_N_INSNS (20)},/* fp_div */
+  {COSTS_N_INSNS (3), COSTS_N_INSNS (3)},  /* int_mul */
+  {COSTS_N_INSNS (6), COSTS_N_INSNS (6)},  /* int_div */
+  6,   /* issue_rate */
+  3,   /* branch_cost */
+  3,   /* memory_cost */
+  3,   /* fmv_cost */
+  true,/* 
slow_unaligned_access */
+  false,   /* use_divmod_expansion */
+  RISCV_FUSE_ZEXTW | RISCV_FUSE_ZEXTH,  /* fusible_ops */
+  NULL,/* vector cost */



Is your integer division really that fast?  The table above essentially
says that your cpu can do integer division in 6 cycles.


Hmm, I just seen I've coded some even smaller value for LoongArch CPUs
so forgive me for "hijacking" this thread...

The problem seems integer division may spend different number of cycles
for different inputs: on LoongArch LA664 I've observed 5 cycles for some
inputs and 39 cycles for other inputs.

So should we use the minimal value, the maximum value, or something in-
between for TARGET_RTX_COSTS and pipeline descriptions?

Yea, early outs are relatively common in the actual hardware
implementation.

The biggest reason to refine the cost of a division is so that we've got
a reasonably accurate cost for division by a constant -- which can often
be done with multiplication by reciprocal sequence.  The multiplication
by reciprocal sequence will use mult, add, sub, shadd insns and you need
a reasonable cost model for those so you can compare against the cost of
a hardware division.

So to answer your question.  Choose something sensible, you probably
don't want the fastest case and you may not want the slowest case.


Maybe we should have some sort of per-bit-set cost hook for mul/div?  
Without that we're kind of just guessing at whether the implmentation 
has early outs based on hueristics used to implicitly generate the cost 
models.


Not sure that's really worth the complexity, though...


Jeff


Re: TARGET_RTX_COSTS and pipeline latency vs. variable-latency instructions (was Re: [PATCH] RISC-V: Add XiangShan Nanhu microarchitecture.)

2024-03-25 Thread Jeff Law




On 3/25/24 1:48 PM, Xi Ruoyao wrote:

On Mon, 2024-03-18 at 20:54 -0600, Jeff Law wrote:

+/* Costs to use when optimizing for xiangshan nanhu.  */
+static const struct riscv_tune_param xiangshan_nanhu_tune_info = {
+  {COSTS_N_INSNS (3), COSTS_N_INSNS (3)},  /* fp_add */
+  {COSTS_N_INSNS (3), COSTS_N_INSNS (3)},  /* fp_mul */
+  {COSTS_N_INSNS (10), COSTS_N_INSNS (20)},/* fp_div */
+  {COSTS_N_INSNS (3), COSTS_N_INSNS (3)},  /* int_mul */
+  {COSTS_N_INSNS (6), COSTS_N_INSNS (6)},  /* int_div */
+  6,   /* issue_rate */
+  3,   /* branch_cost */
+  3,   /* memory_cost */
+  3,   /* fmv_cost */
+  true,/* 
slow_unaligned_access */
+  false,   /* use_divmod_expansion */
+  RISCV_FUSE_ZEXTW | RISCV_FUSE_ZEXTH,  /* fusible_ops */
+  NULL,/* vector cost */



Is your integer division really that fast?  The table above essentially
says that your cpu can do integer division in 6 cycles.


Hmm, I just seen I've coded some even smaller value for LoongArch CPUs
so forgive me for "hijacking" this thread...

The problem seems integer division may spend different number of cycles
for different inputs: on LoongArch LA664 I've observed 5 cycles for some
inputs and 39 cycles for other inputs.

So should we use the minimal value, the maximum value, or something in-
between for TARGET_RTX_COSTS and pipeline descriptions?
Yea, early outs are relatively common in the actual hardware 
implementation.


The biggest reason to refine the cost of a division is so that we've got 
a reasonably accurate cost for division by a constant -- which can often 
be done with multiplication by reciprocal sequence.  The multiplication 
by reciprocal sequence will use mult, add, sub, shadd insns and you need 
a reasonable cost model for those so you can compare against the cost of 
a hardware division.


So to answer your question.  Choose something sensible, you probably 
don't want the fastest case and you may not want the slowest case.


Jeff