Hi Mauro,

On 11/18/25 8:54 PM, Mauro Carvalho Chehab wrote:
On Tue, Nov 18, 2025 at 10:47:55AM +0000, Jonathan Cameron wrote:
On Thu, 13 Nov 2025 03:25:27 +1000
Gavin Shan <[email protected]> wrote:

In the combination of 64KiB host and 4KiB guest, a problematic host
page affects 16x guest pages. Those 16x guest pages are most likely
owned by separate threads and accessed by the threads in parallel.
It means 16x memory errors can be raised at once. However, we're
unable to handle this situation because the only error source has
one read acknowledgement register in current design. QEMU has to
crash in the following path due to the previously delivered error
isn't acknowledged by the guest on attempt to deliver another error.

   kvm_vcpu_thread_fn
     kvm_cpu_exec
       kvm_arch_on_sigbus_vcpu
         kvm_cpu_synchronize_state
         acpi_ghes_memory_errors
         abort

This series fixes the issue by sending 16x consective CPER errors
which are contained in a single GHES error block.

PATCH[1-4] Increases GHES raw data maximal length from 1KiB to 4KiB
PATCH[5]   Supports multiple error records in a single error block
PATCH[6-7] Improves the error handling in the error delivery path
PATCH[8]   Sends 16x consective CPERs in a single block if needed


Hi Gavin,

Just a quick head's up to say we've had some internal discussions around the
kernel handling of broader address masks in CPER and think it is probably
broken. Rectifying that may at least simplify what is needed on the QEMU side
of things and maybe even handle much larger blocks (2M and larger).

Btw, I just added a logic at rasdaemon to catch SIGBUS errors:
https://github.com/mchehab/rasdaemon/pull/199

But so far, I didn't find a proper way to check such code.

Jonathan/Gavin,

Do you know a good way for us to check how the mm SEA notification
is handled with QEMU?


Sorry that I'm not familiar with rasdaemon. Could you please provide more
contexts about your question?

Thanks,
Gavin


Reply via email to