On 2/4/2026 9:21 AM, Nikola Ratkovac wrote:
This patch introduces a vector cost model for the Spacemit-X60 core.
The model is LMUL-aware, based on measurements showing that vector instruction
latency and throughput vary significantly with LMUL, so the cost
model distinguishes between m1/m2/m4/m8 cases.
To keep the machine description manageable, a new 'vector_lmul'
attribute is introduced to map RVV modes to their corresponding LMUL
values. The costs are based on llvm-mca performance simulations
and microbenchmarks, with additional stress tests used to validate
and adjust individual instruction types.
On selected numerical benchmarks this results in performance improvements
of ~3%, while instruction counts remain effectively unchanged (<0.1%).
| Benchmark | Metric | Trunk | Vector Cost Model | Δ (%) |
|------------------|--------|------------------|-------------------|---------|
| SciMark2-C | cycles | 311,538,498,801 | 300,670,104,666 | -3.49% |
|------------------|--------|------------------|-------------------|---------|
| tramp3d-v4 | cycles | 23,673,618,009 | 22,916,964,182 | -3.20% |
|------------------|--------|------------------|-------------------|---------|
| Freebench/neural | cycles | 471,768,472 | 454,850,594 | -3.59% |
|------------------|--------|------------------|-------------------|---------|
Benchmarks were run from the LLVM test-suite
(MultiSource/Benchmarks) using:
taskset -c 0 perf stat -r 10 ./...
SciMark2-C, FreeBench/neural, and tramp3d-v4
were used as representative numerical workloads.
For tramp3d-v4, the workload parameters (--cartvis 1.0 0.0, --rhomin 1e-8,
-n 20) increase floating-point intensity and dependency pressure, placing
greater stress on the scheduler.
2026-02-04 Nikola Ratkovac <[email protected]>
gcc/ChangeLog:
* config/riscv/spacemit-x60.md: Add primary vector pipeline model
for the Spacemit-X60 core.
(vector_lmul): New attribute mapping machine modes to LMUL.
(spacemit_x60_dummy): Rename from spacemi6_x60_dummy.
So at a high level, I would clamp all the reservations at 7c of
reservation -- beyond that the DFA blows up badly which will
significantly harm build times for GCC itself. And in reality it's
extremely difficult to find enough independent instructions to fill
latencies more than a few cycles, so the delta in code quality is
minimal. It's fine to have the latency higher, but clamp the number of
cycles in the reservation.
Where are you getting pipeline information from? LLVM? The c908
manuals, something from SpacemIT?
I don't see that you try to handle the dual vector integer ALUs.
Essentially there's two 128 bit ALUs. Some operations can go into both
ALUs, some are restricted to either ALU0 or ALU1. When operations are
handled in both a 256bit VLEN instruction has an observed latency of 1c
(ie, something like a vadd with lmul1). When an operation is restricted
to just one of the ALUs, the observed latency is 2c because the data has
to be double-pumped (shifts I think fall into this category). At least
that's my understanding of how the unit works.
I have no clue how the vector FP unit works but I'd be somewhat
surprised if under the hood it's also a 128bit data path with dual units
to give the illusion of a 256bit data path with shorter latencies.
Jeff