Hi Robin,

> > Good point. I understand "cost" is an abstract concept, so I wasn't sure if
> > it should scale with LMUL the same way hardware instruction's 
> > latency/throughput does.

> Yeah, generally "cost" is abstract and means different things at different 
> points.  For vector costing we want to compare scalar to vector, though, and 
> latency/throughput directly matter when compared to scalar.  Scalar costs are 
> normally simple, i.e. everything costs "1".  With a superscalar uarch we 
> would 
> multiply the vector costs by e.g. 2 to account for 4 scalar ALUs vs 2 vector 
> ALUs.  Once this scaling is out of the way we more or less directly compare 
> latency.  So if a vector op takes 4 cycles at LMUL1 it might take 8 cycles at 
> LMUL2.  This also means that in those 8 cycles 2x the number of scalar ops 
> could execute.

> We don't have this "scalar" scaling part right now, though.  Mostly because 
> we're unsure about which machine to target.  With better hardware 
> availability 
> this should change soon, though.

> Do you have a specific uarch in mind?

Thanks for the detailed feedback.

For high-performance OoO uarchs, the number of ALUs is usually reflected
directly in instruction throughput. So in theory we could derive the scalar
vs vector scaling factor from the CPU scheduling model.

I looked at the GCC RISC-V scheduling models (spacemit-x60.md, xiangshan.md,
sifive-p600.md) - they do have ALU unit counts defined. I also noticed LLVM's
SchedMachineModel has similar information (throughput, latency, resource units
per instruction class). Not sure if LLVM already uses this for vector cost
scaling though.

Also, if this "scalar" scaling is hardcoded as a fixed value, it will always
be unfriendly to some uarchs - different CPUs have very different scalar/vector
ALU ratios. Ideally this should come from the CPU model.

Another concern: even the 4 scalar ALUs vs 2 vector ALUs ratio may not be
sufficient for scaling. VLEN also matters - a vector op with VLEN=512 and one
with VLEN=128 shouldn't have the same cost scaling, since what we really want
to compare is the cost of processing the same amount of data. I know RVV is
VLA, but maybe we could start with a default VLEN=128 and allow users to
adjust the scaling via some options.

As for available hardware: I don't think current RV chips on the market are
suitable for this kind of tuning - there's no server-level superscalar OoO
uarch comparable to ARM Neoverse or Intel/AMD yet.

That said, I'm actually tuning GCC based on a high-performance OoO uarch via
FPGA emulation. If you have ideas to validate, feel free to send me the patch.
I'd be happy to run SPEC CPU2017 benchmarks and share the results. 
This might help move things forward while we wait for real hardware.


Regards
Zhongyao

Reply via email to