On 9/23/25 1:45 PM, Paul-Antoine Arras wrote:
I experimented with this patch which allows to remove a vfmv when a
floating-point op can be loaded directly from memory with a zero-stride
vlse.
In terms of benchmarks, I measured the following reductions in icount:
* 503.bwaves: -4.0%
* 538.imagick: -3.3%
* 549.fotonik3d: -0.34%
However, the icount for 507.cactuBSSN increased by 0.43%. In addition,
measurements on the BPI board show that the patch actually increases
execution times by 5 to 11%.
This may still be beneficial for some uarchs but would have to be
tunable, wouldn't it?
Is worth proceeding with this?
It's probably worth investigating. DO you happen to have A/B binaries
handy still? I could throw them onto our design.
Austin and I tested the BPI for the zero-strided load idiom, but just on
the integer side and it looked like it likely supported optimizing those
into a single load + an internal broadcast across the vector. So it's a
bit of a surprise to see it not performing well at all for FP.
Note there is an entry in the riscv_tune_param structure controlling the
zero-stride idiom. So you could test that quite easily and assuming the
port had things defined properly it would just work.
Jeff