On 9/25/25 4:05 AM, Paul-Antoine Arras wrote:
Hi Jeff,
On 23/09/2025 22:39, Jeff Law wrote:
On 9/23/25 1:45 PM, Paul-Antoine Arras wrote:
I experimented with this patch which allows to remove a vfmv when a
floating-point op can be loaded directly from memory with a zero-
stride vlse.
In terms of benchmarks, I measured the following reductions in icount:
* 503.bwaves: -4.0%
* 538.imagick: -3.3%
* 549.fotonik3d: -0.34%
However, the icount for 507.cactuBSSN increased by 0.43%. In
addition, measurements on the BPI board show that the patch actually
increases execution times by 5 to 11%.
This may still be beneficial for some uarchs but would have to be
tunable, wouldn't it?
Is worth proceeding with this?
It's probably worth investigating. DO you happen to have A/B binaries
handy still? I could throw them onto our design.
Yes, you'll find attached the two binaries I built and tested on the BPI.
I built A/B binaries for bwaves and just ran input #1 on design. The
results roughly math yours. About a 5% regression in performance with a
5% improvement in icount.
We do have recognition of the zero stride load idiom in our design and
it works for integer sources. The fact that FP performs so poorly is
quite a surprise. Though this top line behavior does match what we're
seeing on the BPI as well.
I'm getting some data with perf record to see if there's perhaps
something goofy going on that can be easily spotted. What doesn't make
much sense here is our LSU shouldn't really care about the underlying
data types.
Jeff