> I know I was pushing for it to be enabled more widely as it's painfully hard > to forward from a narrow store to a wider load. But based on earlier > discussions I've backed off that position.
FWIW I would expect any slightly better OOO core aimed at general purpose code to have some form of hardware support for a subset of the cases. So unless you're talking very simple CPUs like micro controllers or DSPs there will be a non trivial cost function. > > Or we could have a target hook indicating the cost of a narrow store > followed by a wide load and base decisions off that. The rules can be very complicated. As an example see the diagram in https://chipsandcheese.com/p/a-peek-at-sapphire-rapids https://substackcdn.com/image/fetch/$s_!rESw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69b17f38-0631-424d-8e05-7988f9b174f6_2559x1214.png It's probably possible to write a tool that auto generates a cost matrix for each core using timing tests. Another option would be to drive it out of autofdo, that could even cover indirect pointer reference cases. -Andi
