On 11/20/25 9:37 AM, Richard Biener wrote:
Am 20.11.2025 um 16:53 schrieb Jeff Law <[email protected]>:
On 11/19/25 9:42 AM, Andi Kleen wrote:
I know I was pushing for it to be enabled more widely as it's painfully hard
to forward from a narrow store to a wider load. But based on earlier
discussions I've backed off that position.
FWIW I would expect any slightly better OOO core aimed at general
purpose code to have some form of hardware support for a subset of the
cases.
The narrow store to wide load is the problem space, even for OOO cores. I fully
expect any modern performance core to forward when the load can get all of its
data from a single prior store.
The rules can be very complicated. As an example see the diagram
in https://chipsandcheese.com/p/a-peek-at-sapphire-rapids
https://substackcdn.com/image/fetch/$s_!rESw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69b17f38-0631-424d-8e05-7988f9b174f6_2559x1214.png
They don't look significantly more complex than I expected. Essentially if the
load is contained within the store, then it's forwarded, with a possible
penalty if there isn't a perfect start match, but it's still forwarded.
If there's a partial overlap then no store to load forwarding occurs and you
take that full 19c penalty.
There’s the strategy of increasing the issue distance between store and load.
Some OOO implementations now try to anticipate and delay a load. The compiler
could do its own thing here during scheduling (usually to the contrary goal of
delaying stores and issueing loads as early as possible).
Yup. Memory dependence predictors based on TAGE should be commonplace
going forward.
Jeff