Re: [PATCH] RISC-V: Add tt-ascalon-d8 pipeline description

Jeff Law Sat, 23 Aug 2025 07:06:15 -0700



On 8/21/25 11:48 PM, Anton Blanchard wrote:

Add pipeline description for the Tenstorrent Ascalon 8 wide CPU.

gcc/ChangeLog:
        * config/riscv/riscv-cores.def (RISCV_TUNE): Update.
        * config/riscv/riscv-opts.h (enum riscv_microarchitecture_type):
          Add tt_ascalon_d8.
        * config/riscv/riscv.md: Update tune attribute and include
          tt-ascalon-d8.md.
        * config/riscv/tenstorrent-ascalon.md: New file.

Signed-off-by: Anton Blanchard <ant...@tenstorrent.com>
---
  gcc/config/riscv/riscv-cores.def  |   2 +-
  gcc/config/riscv/riscv-opts.h     |   1 +
  gcc/config/riscv/riscv.md         |   3 +-
  gcc/config/riscv/tt-ascalon-d8.md | 374 ++++++++++++++++++++++++++++++
  4 files changed, 378 insertions(+), 2 deletions(-)
  create mode 100644 gcc/config/riscv/tt-ascalon-d8.md

A few notes:

I modelled decode since the aggregate issue bandwidth (for the right sequence
of instructions) is way above this. Not sure if it's necessary but it felt like
the right thing to do.

There's no 100% right answer every time on this kind of stuff. If itseems to be working well, then consider it the right approach. For OOOcores it's often just getting the basics right (load-use stalls forexample) that matters the most and the deeper the OOO reordering, theless the DFA matters.


There is no LMUL cost adjustment (riscv_sched_adjust_cost() checks for
generic-vector-ooo tuning). Each target is likely to have different
adjustments, so do we need an optional cost adjustment function in
riscv_tune_param?

I think Robin (internally, but we can pass it along if it'd be helpful)did LMUL scaling in adjust_stmt_cost. While each target may havedifferent adjustments, I wouldn't be surprised if the basics scalelinearlly across designs. After all for a lot of stuff LMUL2 just meanswe double pump the datapath, similarly for LMUL4/LMUL8, but quad/octa.

Where things might get interesting is whether or not designs havemultiple sequencers or if they're shared (and if shared, the precisesharing configuration). ie, if it's a single sequencer, but you have 2vector ALUs sharing it, you likely can't have two LMUL>1 ALU ops inflight as they compete for the sequencer. Anyway, that's just one ofthe things to think about. In general I don't think we've got a goodhandle on LMUL>1 handling throughout the compiler.


On tt-ascalon, instructions that take a GPR for input incur an extra fixed
cost. I'm not sure how best to do that - it seems like we'd need an awful lot
of extra types if we wanted to split almost every instruction up into 2 (one
for FP/VEC sources only, one for GPR source). Can we adjust the cost in the the
per target cost adjustment function suggested above?

Do you mean _vector_ instructions that take a GPR input? I think thatcan be modeled as an anti-bypass or something similar.


The complicated vector load/store instructions vary depending on the inputs and
for now I just put in median values.

Yea. You run into similar problems with division units that have earlyexit paths. Modeling common cost likely gets you the vast majority ofthe benefit, further refinement is certainly possible in response toreal world code.

+
+;; Integer load/store
+(define_insn_reservation "tt_ascalon_d8_int_load" 4
+  (and (eq_attr "tune" "tt_ascalon_d8")
+       (eq_attr "type" "load"))
+  "tt_ascalon_d8_decode,tt_ascalon_d8_ls")
+
+(define_insn_reservation "tt_ascalon_d8_int_store" 4
+  (and (eq_attr "tune" "tt_ascalon_d8")
+       (eq_attr "type" "store"))
+  "tt_ascalon_d8_decode,tt_ascalon_d8_ls")
+
+;; Float load/store
+(define_insn_reservation "tt_ascalon_d8_float_load" 4
+  (and (eq_attr "tune" "tt_ascalon_d8")
+       (eq_attr "type" "fpload"))
+  "tt_ascalon_d8_decode,tt_ascalon_d8_ls")
+
+(define_insn_reservation "tt_ascalon_d8_float_store" 4
+  (and (eq_attr "tune" "tt_ascalon_d8")
+       (eq_attr "type" "fpstore"))
+  "tt_ascalon_d8_decode,tt_ascalon_d8_ls")

All look quite sensible. You could do these as as single reservation byusing (eq_attr "type" "load,store,fpload,fpstore"), but that's certainlynot necessary and whether or not it's cleaner is a personal preference.

+
+;; Integer division is not pipelined.  Do not block the unit for more than
+;; three cycles so the DFA does not get too large.  Similar for other
+;; non-pipelined instructions. Division is variable cycles so pick a value
+;; in the middle.
+(define_insn_reservation "tt_ascalon_d8_idiv" 15
+  (and (eq_attr "tune" "tt_ascalon_d8")
+       (eq_attr "type" "idiv"))
+  "tt_ascalon_d8_decode,tt_ascalon_d8_div,tt_ascalon_d8_div*3")

Yup. Quite sensible. You've stumbled across very common issues. It'svirtually impossible to hide all the latency of div/sqrt kinds ofoperations, so folks just clamp at some value that keeps the DFA fromblowing up. The incorrect modeling rarely, if ever, matters in practice.

Similarly with early outs. There's obviously no single value that'sright. I think we're modeling ours at 75% worst case latency, but Isuspect anywhere from 25%-75% is all good.

In all it looks good. I see that for the cases where you have modeselectors, you're covering HF/SF/DF. You might consider covering BF tofuture proof, but I wouldn't consider that a hard requirement.

Did you run Kito's pipeline checker script? I suspect it'll tell youyou need a few more cases in your model. Essentially we have checkingasserts that require every insn to have a mapping to a reservation. Sosomeone could ask for code gen for extensions you don't have, but fortuning on the ascalon-d8 which would likely trigger an ICE.

Most (if not all) DFAs for RISC-V include some kind of dummy/fakereservation where we throw everything not otherwise handled into.


Jeff

Re: [PATCH] RISC-V: Add tt-ascalon-d8 pipeline description

Reply via email to