On 9/23/25 13:39, Jeff Law wrote: > > On 9/23/25 1:45 PM, Paul-Antoine Arras wrote: >> I experimented with this patch which allows to remove a vfmv when a >> floating-point op can be loaded directly from memory with a zero-stride >> vlse. >> >> In terms of benchmarks, I measured the following reductions in icount: >> * 503.bwaves: -4.0% >> * 538.imagick: -3.3% >> * 549.fotonik3d: -0.34% >> >> However, the icount for 507.cactuBSSN increased by 0.43%. In addition, >> measurements on the BPI board show that the patch actually increases >> execution times by 5 to 11%. >> >> This may still be beneficial for some uarchs but would have to be >> tunable, wouldn't it? >> Is worth proceeding with this? > It's probably worth investigating. DO you happen to have A/B binaries > handy still? I could throw them onto our design.
FWIW they will perform poorly on our design: similar to integer zero-stride loads for broadcasts. > Austin and I tested the BPI for the zero-strided load idiom, but just on > the integer side and it looked like it likely supported optimizing those > into a single load + an internal broadcast across the vector. So it's a > bit of a surprise to see it not performing well at all for FP. > > Note there is an entry in the riscv_tune_param structure controlling the > zero-stride idiom. So you could test that quite easily and assuming the > port had things defined properly it would just work. Thx, -Vineet
