On 2/5/2026 10:21 AM, Nikola Ratkovac wrote:
Hi, thank you for the feedback.

I’ve updated the model to clamp all reservations to 7 cycles, while keeping
the full latencies. I wasn’t aware that large reservation values could
significantly impact DFA build times, so thank you for pointing that out.
Yea, it's widely known by those who have been bitten by problems in this space.  But otherwise it's pretty obscure.

Regarding -madjust-lmul-cost, using it for Spacemit-X60 would require
adjusting the condition in riscv.cc in the function riscv_sched_adjust_cost
on line 11829:

-  if (!TARGET_VECTOR || riscv_microarchitecture != generic_ooo)
+  if (!TARGET_VECTOR || (riscv_microarchitecture != generic_ooo
  && riscv_microarchitecture != spacemit_x60))

Would that be preferable to explicit LMUL handling?
It might.  I could easily see a model where we funnel all the LMUL adjustments through that routine rather than having it in the scheduler description itself where it is a bit awkward due to the duplication.

On the other hand for a large LMUL operation we're almost certainly going to be double, quad or octa pumping the datapath so it keeps those units busy which I don't think can be handled in the adjust_cost hook.   But it's also the case that large LMUL ops are likely going to need to be clamped to avoid blowing up the DFA, so maybe describing the functional unit hazard beyond LMUL1 doesn't make much sense.
Pipeline costs are based on llvm/test/tools/llvm-mca/RISCV/SpacemitX60/rvv
simulations using the existing Spacemit-X60 model, combined with RVV
instruction-level microbenchmarks from the camel-cdr/rvv-bench repo,
which I ran on the Banana Pi BPI-F3 board.
Understood.  Thanks.  The c908 user manual may be of some help as well (there's unconfirmed rumors the K1 was derived from the c908 design).  It seems to be hard to find on the net these days, but I think I've got a copy here.  It certainly helped explain some weird behavior I saw with the scalar integer units.



In the patch I intentionally treat the VXU as a single logical unit.
I don’t have enough data to reliably classify all instruction types,
so I kept the model simple. The same applies to vector FP.
Understood.  That was the model I was starting to guide Austin towards when he had to set it aside.  Essentially ignore the possibility of utilizing the split 128bit units independently.  With that kind of model we'd be looking to identify those operations that only run in one of the units for adjustment of their latency.

Jeff

Reply via email to