On 9/23/25 1:45 PM, Paul-Antoine Arras wrote:
I experimented with this patch which allows to remove a vfmv when a floating-point op can be loaded directly from memory with a zero-stride vlse.

In terms of benchmarks, I measured the following reductions in icount:
* 503.bwaves: -4.0%
* 538.imagick: -3.3%
* 549.fotonik3d: -0.34%

However, the icount for 507.cactuBSSN increased by 0.43%. In addition, measurements on the BPI board show that the patch actually increases execution times by 5 to 11%.

This may still be beneficial for some uarchs but would have to be tunable, wouldn't it?
Is worth proceeding with this?
It's probably worth investigating. DO you happen to have A/B binaries handy still? I could throw them onto our design.

Austin and I tested the BPI for the zero-strided load idiom, but just on the integer side and it looked like it likely supported optimizing those into a single load + an internal broadcast across the vector. So it's a bit of a surprise to see it not performing well at all for FP.

Note there is an entry in the riscv_tune_param structure controlling the zero-stride idiom. So you could test that quite easily and assuming the port had things defined properly it would just work.

Jeff


Reply via email to