On 12/14/24 1:09 AM, Anton Blanchard wrote:
This adds the Tenstorrent Ascalon 8 wide architecture (tt-ascalon-d8)
to the list of known cores.

gcc/ChangeLog:

        * config/riscv/riscv-cores.def: Add tt-ascalon-d8.
        * config/riscv/riscv.cc (tt_ascalon_d8_tune_info): New.

gcc/testsuite/ChangeLog:

        * gcc.target/riscv/mcpu-tt-ascalon-d8.c: New test.
Generally looks pretty good. You know the uarch far better than I, so the questions below are just that -- questions that might lead you to different conclusions about tuning.


+/* Costs to use when optimizing for Tenstorrent Ascalon 8 wide.  */
+static const struct riscv_tune_param tt_ascalon_d8_tune_info = {
+  {COSTS_N_INSNS (2), COSTS_N_INSNS (2)},      /* fp_add */
+  {COSTS_N_INSNS (3), COSTS_N_INSNS (3)},      /* fp_mul */
+  {COSTS_N_INSNS (9), COSTS_N_INSNS (16)},     /* fp_div */
+  {COSTS_N_INSNS (3), COSTS_N_INSNS (3)},      /* int_mul */
+  {COSTS_N_INSNS (15), COSTS_N_INSNS (15)},    /* int_div */
If your integer divider has early exit paths you may want to reduce the int_div costs a bit. I found that ~75% of the actual latency as the cost worked pretty well for our uarch. Obviously this is a heuristic and there's no perfect value.




+  false,                                       /* use_divmod_expansion */
+  true,                                                /* overlap_op_by_pieces 
*/
+  RISCV_FUSE_NOTHING,                           /* fusible_ops */
So you've marked as not having any fusion capability. That would suggest to me quite strongly that you should be using divmod expansion.

Essentially divmod expansion exposes a pattern which produces the quotient & remainder outputs using a single div + mult + sub which is almost always going to be faster than a div and a mod instruction.

In the case where you don't need the remainder the mult/sub will get trivially removed as dead code. In the case where you don't need the quotient the sequenece will be transformed back into a single rem instruction later in the RTL passes (probably combine).

If your processor has fusion capabilities, you might want to look at if they map to the ones currently supported and if so set the right bits for fusible ops. If there's cases missing that your processor supports, then we should probably work together as I've got an engineer that's expanded the set of fusible cases in our internal gcc tree that I can make available (just haven't had the time to work through the internal review process yet).

jeff

Reply via email to