On 8/21/25 11:48 PM, Anton Blanchard wrote:
Add pipeline description for the Tenstorrent Ascalon 8 wide CPU.
gcc/ChangeLog:
* config/riscv/riscv-cores.def (RISCV_TUNE): Update.
* config/riscv/riscv-opts.h (enum riscv_microarchitecture_type):
Add tt_ascalon_d8.
* config/riscv/riscv.md: Update tune attribute and include
tt-ascalon-d8.md.
* config/riscv/tenstorrent-ascalon.md: New file.
Signed-off-by: Anton Blanchard <ant...@tenstorrent.com>
---
gcc/config/riscv/riscv-cores.def | 2 +-
gcc/config/riscv/riscv-opts.h | 1 +
gcc/config/riscv/riscv.md | 3 +-
gcc/config/riscv/tt-ascalon-d8.md | 374 ++++++++++++++++++++++++++++++
4 files changed, 378 insertions(+), 2 deletions(-)
create mode 100644 gcc/config/riscv/tt-ascalon-d8.md
A few notes:
I modelled decode since the aggregate issue bandwidth (for the right sequence
of instructions) is way above this. Not sure if it's necessary but it felt like
the right thing to do.
There's no 100% right answer every time on this kind of stuff. If it
seems to be working well, then consider it the right approach. For OOO
cores it's often just getting the basics right (load-use stalls for
example) that matters the most and the deeper the OOO reordering, the
less the DFA matters.
There is no LMUL cost adjustment (riscv_sched_adjust_cost() checks for
generic-vector-ooo tuning). Each target is likely to have different
adjustments, so do we need an optional cost adjustment function in
riscv_tune_param?
I think Robin (internally, but we can pass it along if it'd be helpful)
did LMUL scaling in adjust_stmt_cost. While each target may have
different adjustments, I wouldn't be surprised if the basics scale
linearlly across designs. After all for a lot of stuff LMUL2 just means
we double pump the datapath, similarly for LMUL4/LMUL8, but quad/octa.
Where things might get interesting is whether or not designs have
multiple sequencers or if they're shared (and if shared, the precise
sharing configuration). ie, if it's a single sequencer, but you have 2
vector ALUs sharing it, you likely can't have two LMUL>1 ALU ops in
flight as they compete for the sequencer. Anyway, that's just one of
the things to think about. In general I don't think we've got a good
handle on LMUL>1 handling throughout the compiler.
On tt-ascalon, instructions that take a GPR for input incur an extra fixed
cost. I'm not sure how best to do that - it seems like we'd need an awful lot
of extra types if we wanted to split almost every instruction up into 2 (one
for FP/VEC sources only, one for GPR source). Can we adjust the cost in the the
per target cost adjustment function suggested above?
Do you mean _vector_ instructions that take a GPR input? I think that
can be modeled as an anti-bypass or something similar.
The complicated vector load/store instructions vary depending on the inputs and
for now I just put in median values.
Yea. You run into similar problems with division units that have early
exit paths. Modeling common cost likely gets you the vast majority of
the benefit, further refinement is certainly possible in response to
real world code.
+
+;; Integer load/store
+(define_insn_reservation "tt_ascalon_d8_int_load" 4
+ (and (eq_attr "tune" "tt_ascalon_d8")
+ (eq_attr "type" "load"))
+ "tt_ascalon_d8_decode,tt_ascalon_d8_ls")
+
+(define_insn_reservation "tt_ascalon_d8_int_store" 4
+ (and (eq_attr "tune" "tt_ascalon_d8")
+ (eq_attr "type" "store"))
+ "tt_ascalon_d8_decode,tt_ascalon_d8_ls")
+
+;; Float load/store
+(define_insn_reservation "tt_ascalon_d8_float_load" 4
+ (and (eq_attr "tune" "tt_ascalon_d8")
+ (eq_attr "type" "fpload"))
+ "tt_ascalon_d8_decode,tt_ascalon_d8_ls")
+
+(define_insn_reservation "tt_ascalon_d8_float_store" 4
+ (and (eq_attr "tune" "tt_ascalon_d8")
+ (eq_attr "type" "fpstore"))
+ "tt_ascalon_d8_decode,tt_ascalon_d8_ls")
All look quite sensible. You could do these as as single reservation by
using (eq_attr "type" "load,store,fpload,fpstore"), but that's certainly
not necessary and whether or not it's cleaner is a personal preference.
+
+;; Integer division is not pipelined. Do not block the unit for more than
+;; three cycles so the DFA does not get too large. Similar for other
+;; non-pipelined instructions. Division is variable cycles so pick a value
+;; in the middle.
+(define_insn_reservation "tt_ascalon_d8_idiv" 15
+ (and (eq_attr "tune" "tt_ascalon_d8")
+ (eq_attr "type" "idiv"))
+ "tt_ascalon_d8_decode,tt_ascalon_d8_div,tt_ascalon_d8_div*3")
Yup. Quite sensible. You've stumbled across very common issues. It's
virtually impossible to hide all the latency of div/sqrt kinds of
operations, so folks just clamp at some value that keeps the DFA from
blowing up. The incorrect modeling rarely, if ever, matters in practice.
Similarly with early outs. There's obviously no single value that's
right. I think we're modeling ours at 75% worst case latency, but I
suspect anywhere from 25%-75% is all good.
In all it looks good. I see that for the cases where you have mode
selectors, you're covering HF/SF/DF. You might consider covering BF to
future proof, but I wouldn't consider that a hard requirement.
Did you run Kito's pipeline checker script? I suspect it'll tell you
you need a few more cases in your model. Essentially we have checking
asserts that require every insn to have a mapping to a reservation. So
someone could ask for code gen for extensions you don't have, but for
tuning on the ascalon-d8 which would likely trigger an ICE.
Most (if not all) DFAs for RISC-V include some kind of dummy/fake
reservation where we throw everything not otherwise handled into.
Jeff