> I know I was pushing for it to be enabled more widely as it's painfully hard
> to forward from a narrow store to a wider load.  But based on earlier
> discussions I've backed off that position.

FWIW I would expect any slightly better OOO core aimed at general
purpose code to have some form of hardware support for a subset of the
cases.

So unless you're talking very simple CPUs like micro controllers or DSPs
there will be a non trivial cost function.

> 
> Or we could have a target hook indicating the cost of a narrow store
> followed by a wide load and base decisions off that.

The rules can be very complicated. As an example see the diagram
in https://chipsandcheese.com/p/a-peek-at-sapphire-rapids

https://substackcdn.com/image/fetch/$s_!rESw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69b17f38-0631-424d-8e05-7988f9b174f6_2559x1214.png

It's probably possible to write a tool that auto generates a cost
matrix for each core using timing tests.

Another option would be to drive it out of autofdo, that could even
cover indirect pointer reference cases.

-Andi

Reply via email to