On 9/25/25 4:05 AM, Paul-Antoine Arras wrote:
Hi Jeff,

On 23/09/2025 22:39, Jeff Law wrote:
On 9/23/25 1:45 PM, Paul-Antoine Arras wrote:
I experimented with this patch which allows to remove a vfmv when a floating-point op can be loaded directly from memory with a zero- stride vlse.

In terms of benchmarks, I measured the following reductions in icount:
* 503.bwaves: -4.0%
* 538.imagick: -3.3%
* 549.fotonik3d: -0.34%

However, the icount for 507.cactuBSSN increased by 0.43%. In addition, measurements on the BPI board show that the patch actually increases execution times by 5 to 11%.

This may still be beneficial for some uarchs but would have to be tunable, wouldn't it?
Is worth proceeding with this?
It's probably worth investigating.  DO you happen to have A/B binaries handy still?  I could throw them onto our design.

Yes, you'll find attached the two binaries I built and tested on the BPI.
I built A/B binaries for bwaves and just ran input #1 on design. The results roughly math yours. About a 5% regression in performance with a 5% improvement in icount.

We do have recognition of the zero stride load idiom in our design and it works for integer sources. The fact that FP performs so poorly is quite a surprise. Though this top line behavior does match what we're seeing on the BPI as well.

I'm getting some data with perf record to see if there's perhaps something goofy going on that can be easily spotted. What doesn't make much sense here is our LSU shouldn't really care about the underlying data types.

Jeff

Reply via email to