On Fri, Oct 01, 2021 at 11:29:43AM -0700, Dan Williams wrote:
> My read is that the guest gets virtual #MC on an access to that page.
> When the guest tries to do set_memory_uc() and instructs cpa_flush()
> to do clean caches that results in taking another fault / exception
> perhaps because the VMM unmapped the page from the guest? If the guest
> had flipped the page to NP then cpa_flush() says "oh, no caching
> change, skip the clflush() loop".
... and the CLFLUSH is the insn which caused the second MCE because it
"appeared that the guest was accessing the bad page."
Uuf, that could be. Nasty.
> Yeah, I thought UC would make the PMEM driver's life easier, but if it
> has to contend with an NP case at all, might as well make it handle
> that case all the time.
>
> Safe to say this patch of mine is woefully insufficient and I need to
> go look at how to make the guarantees needed by the PMEM driver so it
> can handle NP and set up alias maps.
>
> This was a useful discussion.
Oh yeah, thanks for taking the time!
> It proves that my commit:
>
> 284ce4011ba6 x86/memory_failure: Introduce {set, clear}_mce_nospec()
>
> ...was broken from the outset.
Well, the problem with hw errors is that it is always very hard to test
the code. But I hear injection works now soo... :-)
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette