On 2/4/2026 9:21 AM, Nikola Ratkovac wrote:
This patch introduces a vector cost model for the Spacemit-X60 core.

The model is LMUL-aware, based on measurements showing that vector instruction
latency and throughput vary significantly with LMUL, so the cost
model distinguishes between m1/m2/m4/m8 cases.

To keep the machine description manageable, a new 'vector_lmul'
attribute is introduced to map RVV modes to their corresponding LMUL
values. The costs are based on llvm-mca performance simulations
and microbenchmarks, with additional stress tests used to validate
and adjust individual instruction types.

On selected numerical benchmarks this results in performance improvements
of ~3%, while instruction counts remain effectively unchanged (<0.1%).

| Benchmark        | Metric | Trunk            | Vector Cost Model | Δ (%)   |
|------------------|--------|------------------|-------------------|---------|
| SciMark2-C       | cycles | 311,538,498,801  | 300,670,104,666   | -3.49%  |
|------------------|--------|------------------|-------------------|---------|
| tramp3d-v4       | cycles | 23,673,618,009   | 22,916,964,182    | -3.20%  |
|------------------|--------|------------------|-------------------|---------|
| Freebench/neural | cycles | 471,768,472      | 454,850,594       | -3.59%  |
|------------------|--------|------------------|-------------------|---------|

Benchmarks were run from the LLVM test-suite
(MultiSource/Benchmarks) using:

taskset -c 0 perf stat -r 10 ./...

SciMark2-C, FreeBench/neural, and tramp3d-v4
were used as representative numerical workloads.

For tramp3d-v4, the workload parameters (--cartvis 1.0 0.0, --rhomin 1e-8,
-n 20) increase floating-point intensity and dependency pressure, placing
greater stress on the scheduler.

2026-02-04  Nikola Ratkovac  <[email protected]>

gcc/ChangeLog:

     * config/riscv/spacemit-x60.md: Add primary vector pipeline model
     for the Spacemit-X60 core.
     (vector_lmul): New attribute mapping machine modes to LMUL.
     (spacemit_x60_dummy): Rename from spacemi6_x60_dummy.
So at a high level, I would clamp all the reservations at 7c of reservation -- beyond that the DFA blows up badly which will significantly harm build times for GCC itself.  And in reality it's extremely difficult to find enough independent instructions to fill latencies more than a few cycles, so the delta in code quality is minimal.  It's fine to have the latency higher, but clamp the number of cycles in the reservation.

Where are you getting pipeline information from?  LLVM?  The c908 manuals, something from SpacemIT?

I don't see that you try to handle the dual vector integer ALUs. Essentially there's two 128 bit ALUs.  Some operations can go into both ALUs, some are restricted to either ALU0 or ALU1.  When operations are handled in both a 256bit VLEN instruction has an observed latency of 1c (ie, something like a vadd with lmul1).  When an operation is restricted to just one of the ALUs, the observed latency is 2c because the data has to be double-pumped (shifts I think fall into this category).  At least that's my understanding of how the unit works.

I have no clue how the vector FP unit works but I'd be somewhat surprised if under the hood it's also a 128bit data path with dual units to give the illusion of a 256bit data path with shorter latencies.

Jeff

Reply via email to