On 9/21/23 19:41, Yazen Ghannam wrote:
On 9/20/23 7:13 AM, Joao Martins wrote:
On 18/09/2023 23:00, William Roche wrote:
So it looks like the mechanism works fine... unless the VM has migrated
between the SRAO error and the first time it really touches the poisoned
page to get an SRAR error ! In this case, its new address space
(created on the migration destination) will have a zero-page where we
had a poisoned page, and the AMD VM Kernel (that never dealt with the
SRAO) doesn't know about the poisoned page and will access the page
finding only zeros... We have a memory corruption !
I don't understand this. Why would the page be zero? Even so, why would
that affect poison?
The migration of a VM moves the memory content from a source platform to
a destination. This is mainly the qemu processes reading the data and
replicating it on the destination. The source qemu where a memory page
is poisoned is(will be[*]) able to skip the poisoned pages it knows
about to indicate to the destination machine to populate the associated
page(s) with zeros as there is no "poison destination page" mechanism in
place for this migration transfer.
Also, during page migration, does the data flow through the CPU core?
Sorry for the basic question. I haven't done a lot with virtualization.
Yes, in most cases (with the exception of RDMA) the data flow through
the CPU cores because the migration verifies if the area to transfer has
some empty pages.
Please note that current AMD systems use an internal poison marker on
memory. This cannot be cleared through normal memory operations. The
only exception, I think, is to use the CLZERO instruction. This will
completely wipe a cacheline including metadata like poison, etc.
So the hardware should not (by design) loose track of poisoned data.
This would be better, but virtualization migration currently looses
track of this.
Which is not a problem for VMs where the kernel took note of the poison
and keeps track of it. Because this kernel will handle the poison
locations it knows about, signaling when these poisoned locations are
It is a very rare window, but in order to fix it the most reasonable
course of action would be to make the AMD emulation deal with SRAO
errors, instead of ignoring them.
Do you agree with my analysis ?
Under the case that SRAO aren't handled well in the kernel today[*] for AMD, we
could always add a migration blocker when we hit AO sigbus, in case ignoring
is our only option. But this would be less than ideal to propagating the
SRAO into the guest.
[*] Meaning knowing that handling the SRAO would generate a crash in the guest
Perhaps as an improvement, perhaps allow qemu to choose to propagate should this
limitation be lifted via a new -action value and allow it to ignore/propagate or
-action mce=none # default on Intel to propagate all MCE events to the guest
-action mce=ignore-optional # Ignore SRAO
Yes we may need to create something like that, but missing SRAO has
technical consequences too.
I suppose the second is also useful for ARM64 considering they currently ignore
SRAO events too.
Would an AMD platform generate SRAO signal to a process
(SIGBUS/BUS_MCEERR_AO) in case of a real hardware error ?
This would be useful to confirm.
There is no SRAO signal on AMD. The closest equivalent may be a
"Deferred" error interrupt. This is an x86 APIC LVT interrupt, and it's
sent when a deferred (uncorrectable non-urgent) error is detected by a
In this case, the CPU will get the interrupt and log the error (in the
An enhancement will be to take the MCA error information collected
during the interrupt and extract useful data. For example, we'll need to
translate the reported address to a system physical address that can be
mapped to a page.
This would be great, as it would mean that a kernel running in a VM can
get notified too.
Once we have the page, then we can decide how we want to signal the
process(es). We could get a deferred/AO error in the host, and signal the
guest with an AR. So the guest handling could be the same in both cases. >
Would this be okay? Or is it important that the guest can distinguish
between the A0/AR cases?
SIGBUS/BUS_MCEERR_AO and BUS_MCEERR_AR are not interchangeable, it is
important to distinguish them.
AO is an asynchronous signal that is only generated when the process
asked for it -- indicating that an error has been detected in its
address space but hasn't been touched yet.
Most of the processes don't care about that (and don't get notified),
they just continue to run, if the poisoned area is not touched, great.
Otherwise a BUS_MCEERR_AR signal is generated when the area is touched,
indicating that the execution thread can't access the location.
IOW, will guests have their own policies on
when to take action? Or is it more about allowing the guest to handle
the error less urgently?
Yes to both questions. Any process can indicate if it is interested to
be "early killed on MCE" or not. See proc(5) man page about
/proc/sys/vm/memory_failure_early_kill, and prctl(2) about
PR_MCE_KILL/PR_MCE_KILL_GET. Such a process could take actions before
it's too late and it would need the poisoned data.
Now if an AMD system doesn't warn a process when a Deferred errors
occurs, and only generates SIGBUS/BUS_MCEERR_AR errors when the poison
is touched, it means that its processes don't benefit from an "early
kill" and can't take actions to anticipate a synchronous error.
In such case, ignoring BUS_MCEERR_AO would just help qemu not to crash
in case of "fake/software/injected" signals. And the case of reading the
entire memory (like a migration) would need to be extra careful with a
more probable SIGBUS/BUS_MCEERR_AR signal, which makes the mechanism
more complicated, but would make more sense for AMD and ARM64 too.
(Note that there are still cases where a BUS_MCEERR_AO capable system
can miss an error that is revealed when reading the entire memory, in
this case we currently crash)
[*] See my patch proposal for:
"Qemu crashes on VM migration after an handled memory error"
In other words, having the AMD kernel to generate SIGBUS/BUS_MCEERR_AO
signals and making AMD qemu able to relay them to the VM kernel would
make things better for AMD platforms ;)