Re: [PATCH] RISC-V: Add Tenstorrent Ascalon 8 wide architecture

Jeff Law Sat, 14 Dec 2024 07:59:55 -0800



On 12/14/24 1:09 AM, Anton Blanchard wrote:

This adds the Tenstorrent Ascalon 8 wide architecture (tt-ascalon-d8)
to the list of known cores.

gcc/ChangeLog:

        * config/riscv/riscv-cores.def: Add tt-ascalon-d8.
        * config/riscv/riscv.cc (tt_ascalon_d8_tune_info): New.

gcc/testsuite/ChangeLog:

        * gcc.target/riscv/mcpu-tt-ascalon-d8.c: New test.

Generally looks pretty good. You know the uarch far better than I, sothe questions below are just that -- questions that might lead you todifferent conclusions about tuning.

+/* Costs to use when optimizing for Tenstorrent Ascalon 8 wide.  */
+static const struct riscv_tune_param tt_ascalon_d8_tune_info = {
+  {COSTS_N_INSNS (2), COSTS_N_INSNS (2)},      /* fp_add */
+  {COSTS_N_INSNS (3), COSTS_N_INSNS (3)},      /* fp_mul */
+  {COSTS_N_INSNS (9), COSTS_N_INSNS (16)},     /* fp_div */
+  {COSTS_N_INSNS (3), COSTS_N_INSNS (3)},      /* int_mul */
+  {COSTS_N_INSNS (15), COSTS_N_INSNS (15)},    /* int_div */

If your integer divider has early exit paths you may want to reduce theint_div costs a bit. I found that ~75% of the actual latency as thecost worked pretty well for our uarch. Obviously this is a heuristicand there's no perfect value.

+  false,                                       /* use_divmod_expansion */
+  true,                                                /* overlap_op_by_pieces 
*/
+  RISCV_FUSE_NOTHING,                           /* fusible_ops */

So you've marked as not having any fusion capability. That wouldsuggest to me quite strongly that you should be using divmod expansion.

Essentially divmod expansion exposes a pattern which produces thequotient & remainder outputs using a single div + mult + sub which isalmost always going to be faster than a div and a mod instruction.

In the case where you don't need the remainder the mult/sub will gettrivially removed as dead code. In the case where you don't need thequotient the sequenece will be transformed back into a single reminstruction later in the RTL passes (probably combine).

If your processor has fusion capabilities, you might want to look at ifthey map to the ones currently supported and if so set the right bitsfor fusible ops. If there's cases missing that your processor supports,then we should probably work together as I've got an engineer that'sexpanded the set of fusible cases in our internal gcc tree that I canmake available (just haven't had the time to work through the internalreview process yet).


jeff

Re: [PATCH] RISC-V: Add Tenstorrent Ascalon 8 wide architecture

Reply via email to