On 8/21/25 11:48 PM, Anton Blanchard wrote:
Add pipeline description for the Tenstorrent Ascalon 8 wide CPU.

gcc/ChangeLog:
        * config/riscv/riscv-cores.def (RISCV_TUNE): Update.
        * config/riscv/riscv-opts.h (enum riscv_microarchitecture_type):
          Add tt_ascalon_d8.
        * config/riscv/riscv.md: Update tune attribute and include
          tt-ascalon-d8.md.
        * config/riscv/tenstorrent-ascalon.md: New file.

Signed-off-by: Anton Blanchard <ant...@tenstorrent.com>
---
  gcc/config/riscv/riscv-cores.def  |   2 +-
  gcc/config/riscv/riscv-opts.h     |   1 +
  gcc/config/riscv/riscv.md         |   3 +-
  gcc/config/riscv/tt-ascalon-d8.md | 374 ++++++++++++++++++++++++++++++
  4 files changed, 378 insertions(+), 2 deletions(-)
  create mode 100644 gcc/config/riscv/tt-ascalon-d8.md

A few notes:

I modelled decode since the aggregate issue bandwidth (for the right sequence
of instructions) is way above this. Not sure if it's necessary but it felt like
the right thing to do.
There's no 100% right answer every time on this kind of stuff. If it seems to be working well, then consider it the right approach. For OOO cores it's often just getting the basics right (load-use stalls for example) that matters the most and the deeper the OOO reordering, the less the DFA matters.



There is no LMUL cost adjustment (riscv_sched_adjust_cost() checks for
generic-vector-ooo tuning). Each target is likely to have different
adjustments, so do we need an optional cost adjustment function in
riscv_tune_param?
I think Robin (internally, but we can pass it along if it'd be helpful) did LMUL scaling in adjust_stmt_cost. While each target may have different adjustments, I wouldn't be surprised if the basics scale linearlly across designs. After all for a lot of stuff LMUL2 just means we double pump the datapath, similarly for LMUL4/LMUL8, but quad/octa.

Where things might get interesting is whether or not designs have multiple sequencers or if they're shared (and if shared, the precise sharing configuration). ie, if it's a single sequencer, but you have 2 vector ALUs sharing it, you likely can't have two LMUL>1 ALU ops in flight as they compete for the sequencer. Anyway, that's just one of the things to think about. In general I don't think we've got a good handle on LMUL>1 handling throughout the compiler.


On tt-ascalon, instructions that take a GPR for input incur an extra fixed
cost. I'm not sure how best to do that - it seems like we'd need an awful lot
of extra types if we wanted to split almost every instruction up into 2 (one
for FP/VEC sources only, one for GPR source). Can we adjust the cost in the the
per target cost adjustment function suggested above?
Do you mean _vector_ instructions that take a GPR input? I think that can be modeled as an anti-bypass or something similar.



The complicated vector load/store instructions vary depending on the inputs and
for now I just put in median values.
Yea. You run into similar problems with division units that have early exit paths. Modeling common cost likely gets you the vast majority of the benefit, further refinement is certainly possible in response to real world code.



+
+;; Integer load/store
+(define_insn_reservation "tt_ascalon_d8_int_load" 4
+  (and (eq_attr "tune" "tt_ascalon_d8")
+       (eq_attr "type" "load"))
+  "tt_ascalon_d8_decode,tt_ascalon_d8_ls")
+
+(define_insn_reservation "tt_ascalon_d8_int_store" 4
+  (and (eq_attr "tune" "tt_ascalon_d8")
+       (eq_attr "type" "store"))
+  "tt_ascalon_d8_decode,tt_ascalon_d8_ls")
+
+;; Float load/store
+(define_insn_reservation "tt_ascalon_d8_float_load" 4
+  (and (eq_attr "tune" "tt_ascalon_d8")
+       (eq_attr "type" "fpload"))
+  "tt_ascalon_d8_decode,tt_ascalon_d8_ls")
+
+(define_insn_reservation "tt_ascalon_d8_float_store" 4
+  (and (eq_attr "tune" "tt_ascalon_d8")
+       (eq_attr "type" "fpstore"))
+  "tt_ascalon_d8_decode,tt_ascalon_d8_ls")
All look quite sensible. You could do these as as single reservation by using (eq_attr "type" "load,store,fpload,fpstore"), but that's certainly not necessary and whether or not it's cleaner is a personal preference.



+
+;; Integer division is not pipelined.  Do not block the unit for more than
+;; three cycles so the DFA does not get too large.  Similar for other
+;; non-pipelined instructions. Division is variable cycles so pick a value
+;; in the middle.
+(define_insn_reservation "tt_ascalon_d8_idiv" 15
+  (and (eq_attr "tune" "tt_ascalon_d8")
+       (eq_attr "type" "idiv"))
+  "tt_ascalon_d8_decode,tt_ascalon_d8_div,tt_ascalon_d8_div*3")
Yup. Quite sensible. You've stumbled across very common issues. It's virtually impossible to hide all the latency of div/sqrt kinds of operations, so folks just clamp at some value that keeps the DFA from blowing up. The incorrect modeling rarely, if ever, matters in practice.

Similarly with early outs. There's obviously no single value that's right. I think we're modeling ours at 75% worst case latency, but I suspect anywhere from 25%-75% is all good.

In all it looks good. I see that for the cases where you have mode selectors, you're covering HF/SF/DF. You might consider covering BF to future proof, but I wouldn't consider that a hard requirement.

Did you run Kito's pipeline checker script? I suspect it'll tell you you need a few more cases in your model. Essentially we have checking asserts that require every insn to have a mapping to a reservation. So someone could ask for code gen for extensions you don't have, but for tuning on the ascalon-d8 which would likely trigger an ICE.

Most (if not all) DFAs for RISC-V include some kind of dummy/fake reservation where we throw everything not otherwise handled into.

Jeff

Reply via email to