Re: TARGET_RTX_COSTS and pipeline latency vs. variable-latency instructions (was Re: [PATCH] RISC-V: Add XiangShan Nanhu microarchitecture.)

Palmer Dabbelt Mon, 25 Mar 2024 13:14:12 -0700

On Mon, 25 Mar 2024 12:59:14 PDT (-0700), Jeff Law wrote:



On 3/25/24 1:48 PM, Xi Ruoyao wrote:

On Mon, 2024-03-18 at 20:54 -0600, Jeff Law wrote:

+/* Costs to use when optimizing for xiangshan nanhu.  */
+static const struct riscv_tune_param xiangshan_nanhu_tune_info = {
+  {COSTS_N_INSNS (3), COSTS_N_INSNS (3)},      /* fp_add */
+  {COSTS_N_INSNS (3), COSTS_N_INSNS (3)},      /* fp_mul */
+  {COSTS_N_INSNS (10), COSTS_N_INSNS (20)},    /* fp_div */
+  {COSTS_N_INSNS (3), COSTS_N_INSNS (3)},      /* int_mul */
+  {COSTS_N_INSNS (6), COSTS_N_INSNS (6)},      /* int_div */
+  6,                                           /* issue_rate */
+  3,                                           /* branch_cost */
+  3,                                           /* memory_cost */
+  3,                                           /* fmv_cost */
+  true,                                                /* 
slow_unaligned_access */
+  false,                                       /* use_divmod_expansion */
+  RISCV_FUSE_ZEXTW | RISCV_FUSE_ZEXTH,          /* fusible_ops */
+  NULL,                                                /* vector cost */

Is your integer division really that fast?  The table above essentially
says that your cpu can do integer division in 6 cycles.


Hmm, I just seen I've coded some even smaller value for LoongArch CPUs
so forgive me for "hijacking" this thread...

The problem seems integer division may spend different number of cycles
for different inputs: on LoongArch LA664 I've observed 5 cycles for some
inputs and 39 cycles for other inputs.

So should we use the minimal value, the maximum value, or something in-
between for TARGET_RTX_COSTS and pipeline descriptions?

Yea, early outs are relatively common in the actual hardware
implementation.

The biggest reason to refine the cost of a division is so that we've got
a reasonably accurate cost for division by a constant -- which can often
be done with multiplication by reciprocal sequence.  The multiplication
by reciprocal sequence will use mult, add, sub, shadd insns and you need
a reasonable cost model for those so you can compare against the cost of
a hardware division.

So to answer your question.  Choose something sensible, you probably
don't want the fastest case and you may not want the slowest case.

Maybe we should have some sort of per-bit-set cost hook for mul/div?Without that we're kind of just guessing at whether the implmentationhas early outs based on hueristics used to implicitly generate the costmodels.


Not sure that's really worth the complexity, though...

Jeff

Re: TARGET_RTX_COSTS and pipeline latency vs. variable-latency instructions (was Re: [PATCH] RISC-V: Add XiangShan Nanhu microarchitecture.)

Reply via email to