> I experimented with this patch which allows to remove a vfmv when a > floating-point op can be loaded directly from memory with a zero-stride > vlse. > > In terms of benchmarks, I measured the following reductions in icount: > * 503.bwaves: -4.0% > * 538.imagick: -3.3% > * 549.fotonik3d: -0.34% > > However, the icount for 507.cactuBSSN increased by 0.43%. In addition, > measurements on the BPI board show that the patch actually increases > execution times by 5 to 11%. > > This may still be beneficial for some uarchs but would have to be > tunable, wouldn't it? > Is worth proceeding with this?
As we discussed before, icount can be treacherous, in particular with "clever" patterns like these. And that's the reason why we made the zero-strided-load idiom tunable or didn't try to use it everywhere. Such a big negative swing for real performance is still surprising and my gut feeling would be that we stop hoisting something out of a loop. I kind of agree that the unconditional mem handling contradicts the design goals. It was the most straightforward way, though, to only use strided broadcasts where "absolutely" necessary. I guess an argument can be made to have mem operands "strided broadcastable" instead of broadcastable but then of course for both, integer and float. Consequently, the !strided_load_broadcast fallback would need to be adjusted to not only cover HFmode but all modes. Also, vv -> vx and strided broadcast oppose each other to some degree. If we keep the mem (which helps IRA) until late we cannot propagate, if we split early we don't go back to a vlse and so on. That's all manageable but requires a bit of balancing and I'm not sure how useful it is from a performance perspective. My mental model is that for most uarchs strided load broadcast is at best a nop performance wise and at worst a degradation. Andrew mentioned there are some that heavily favor the strided form but we'd need silicon to actually test that. -- Regards Robin
