On 12/14/24 1:09 AM, Anton Blanchard wrote:
This adds the Tenstorrent Ascalon 8 wide architecture (tt-ascalon-d8)
to the list of known cores.
gcc/ChangeLog:
* config/riscv/riscv-cores.def: Add tt-ascalon-d8.
* config/riscv/riscv.cc (tt_ascalon_d8_tune_info): New.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/mcpu-tt-ascalon-d8.c: New test.
Generally looks pretty good. You know the uarch far better than I, so
the questions below are just that -- questions that might lead you to
different conclusions about tuning.
+/* Costs to use when optimizing for Tenstorrent Ascalon 8 wide. */
+static const struct riscv_tune_param tt_ascalon_d8_tune_info = {
+ {COSTS_N_INSNS (2), COSTS_N_INSNS (2)}, /* fp_add */
+ {COSTS_N_INSNS (3), COSTS_N_INSNS (3)}, /* fp_mul */
+ {COSTS_N_INSNS (9), COSTS_N_INSNS (16)}, /* fp_div */
+ {COSTS_N_INSNS (3), COSTS_N_INSNS (3)}, /* int_mul */
+ {COSTS_N_INSNS (15), COSTS_N_INSNS (15)}, /* int_div */
If your integer divider has early exit paths you may want to reduce the
int_div costs a bit. I found that ~75% of the actual latency as the
cost worked pretty well for our uarch. Obviously this is a heuristic
and there's no perfect value.
+ false, /* use_divmod_expansion */
+ true, /* overlap_op_by_pieces
*/
+ RISCV_FUSE_NOTHING, /* fusible_ops */
So you've marked as not having any fusion capability. That would
suggest to me quite strongly that you should be using divmod expansion.
Essentially divmod expansion exposes a pattern which produces the
quotient & remainder outputs using a single div + mult + sub which is
almost always going to be faster than a div and a mod instruction.
In the case where you don't need the remainder the mult/sub will get
trivially removed as dead code. In the case where you don't need the
quotient the sequenece will be transformed back into a single rem
instruction later in the RTL passes (probably combine).
If your processor has fusion capabilities, you might want to look at if
they map to the ones currently supported and if so set the right bits
for fusible ops. If there's cases missing that your processor supports,
then we should probably work together as I've got an engineer that's
expanded the set of fusible cases in our internal gcc tree that I can
make available (just haven't had the time to work through the internal
review process yet).
jeff