On Sun, 19 Oct 2025 10:36:16 +1000 Gavin Shan <[email protected]> wrote:
> Hi Igor, > > On 10/18/25 12:27 AM, Igor Mammedov wrote: > > On Tue, 7 Oct 2025 16:08:10 +1000 > > Gavin Shan <[email protected]> wrote: > > > >> In the combination of 64KB host and 4KB guest, a problematic host page > >> affects 16x guest pages. In this specific case, it's reasonable to > >> push 16 consecutive memory CPERs. Otherwise, QEMU can run into core > >> dump due to the current error can't be delivered as the previous error > >> isn't acknoledges. It's caused by the nature the host page can be > >> accessed in parallel due to the mismatched host and guest page sizes. > > > > can you explain a bit more what goes wrong? > > > > I'm especially interested in parallel access you've mentioned > > and why batch adding error records is needed > > as opposed to adding records every time invalid access happens? > > > > PS: > > Assume I don't remember details on how HEST works, > > Answering it in this format also should improve commit message > > making it more digestible for uninitiated. > > > > Thanks for your review and I'm trying to answer your question below. Please > let > me know if there are more questions. > > There are two signals (BUS_MCEERR_AR and BUS_MCEERR_AO) and BUS_MCEERR_AR is > concerned here. This signal BUS_MCEERR_AR is sent by host's stage2 page fault > handler when the resolved host page has been marked as marked as poisoned. > The stage2 page fault handler is invoked on every access to the host page. > > In the combination where host and guest has 64KB and 4KB separately, A 64KB > host page corresponds to 16x consecutive 4KB guest pages. It means we're > accessing the 64KB host page when any of those 16x consecutive 4KB guest pages > is accessed. In other words, a problematic 64KB host page affects the accesses > on 16x 4KB guest pages. Those 16x 4KB guest pages can be owned by different > threads on the guest and they run in parallel, potentially to access those > 16x 4KB guest pages in parallel. It potentially leading to 16x BUS_MCEERR_AR > signals at one point. > > In current implementation, the error record is built as the following > calltrace > indicates. There are 16 error records in the extreme case (parallel accesses > on > 16x 4KB guest pages, mapped to one 64KB host page). However, we can't handle > multiple error records at once due to the acknowledgement mechanism in > ghes_record_cper_errors(). For example, the first error record has been sent, > but not consumed by the guest yet. We fail to send the second error record. > > kvm_arch_on_sigbus_vcpu > acpi_ghes_memory_errors > ghes_gen_err_data_uncorrectable_recoverable // Generic Error Data > Entry > acpi_ghes_build_append_mem_cper // Memory Error > ghes_record_cper_errors > > So this series improves this situation by simply sending 16x error records in > one shot for the combination of 64KB host + 4KB guest. 1) What I'm concerned about is that it target one specific case only. Imagine if 1st cpu get error on page1 and another on page2=(page1+host_page_size) and so on for other CPUs. Then we are back where we were before this series. Also in abstract future when ARM gets 1Gb pages, that won't scale well. Can we instead of making up CPERs to cover whole host page, create 1/vcpu GHES source? That way when vcpu trips over bad page, it would have its own error status block to put errors in. That would address [1] and deterministically scale (well assuming that multiple SEA error sources are possible in theory) PS: I also wonder what real HW does when it gets in similar situation (i.e. error status block is not yet acknowledged but another async error arrived for the same error source)? > > Thanks, > Gavin > > > >> Imporve push_ghes_memory_errors() to push 16x consecutive memory CPERs > >> for this specific case. The maximal error block size is bumped to 4KB, > >> providing enough storage space for those 16x memory CPERs. > >> > >> Signed-off-by: Gavin Shan <[email protected]> > >> --- > >> hw/acpi/ghes.c | 2 +- > >> target/arm/kvm.c | 46 +++++++++++++++++++++++++++++++++++++++++++++- > >> 2 files changed, 46 insertions(+), 2 deletions(-) > >> > >> diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c > >> index 045b77715f..5c87b3a027 100644 > >> --- a/hw/acpi/ghes.c > >> +++ b/hw/acpi/ghes.c > >> @@ -33,7 +33,7 @@ > >> #define ACPI_HEST_ADDR_FW_CFG_FILE "etc/acpi_table_hest_addr" > >> > >> /* The max size in bytes for one error block */ > >> -#define ACPI_GHES_MAX_RAW_DATA_LENGTH (1 * KiB) > >> +#define ACPI_GHES_MAX_RAW_DATA_LENGTH (4 * KiB) > >> > >> /* Generic Hardware Error Source version 2 */ > >> #define ACPI_GHES_SOURCE_GENERIC_ERROR_V2 10 > >> diff --git a/target/arm/kvm.c b/target/arm/kvm.c > >> index c5d5b3b16e..3ecb85e4b7 100644 > >> --- a/target/arm/kvm.c > >> +++ b/target/arm/kvm.c > >> @@ -11,6 +11,7 @@ > >> */ > >> > >> #include "qemu/osdep.h" > >> +#include "qemu/units.h" > >> #include <sys/ioctl.h> > >> > >> #include <linux/kvm.h> > >> @@ -2433,10 +2434,53 @@ static void push_ghes_memory_errors(CPUState *c, > >> AcpiGhesState *ags, > >> uint64_t paddr) > >> { > >> GArray *addresses = g_array_new(false, false, sizeof(paddr)); > >> + uint64_t val, start, end, guest_pgsz, host_pgsz; > >> int ret; > >> > >> kvm_cpu_synchronize_state(c); > >> - g_array_append_vals(addresses, &paddr, 1); > >> + > >> + /* > >> + * Sort out the guest page size from TCR_EL1, which can be modified > >> + * by the guest from time to time. So we have to sort it out > >> dynamically. > >> + */ > >> + ret = read_sys_reg64(c->kvm_fd, &val, ARM64_SYS_REG(3, 0, 2, 0, 2)); > >> + if (ret) { > >> + goto error; > >> + } > >> + > >> + switch (extract64(val, 14, 2)) { > >> + case 0: > >> + guest_pgsz = 4 * KiB; > >> + break; > >> + case 1: > >> + guest_pgsz = 64 * KiB; > >> + break; > >> + case 2: > >> + guest_pgsz = 16 * KiB; > >> + break; > >> + default: > >> + error_report("unknown page size from TCR_EL1 (0x%" PRIx64 ")", > >> val); > >> + goto error; > >> + } > >> + > >> + host_pgsz = qemu_real_host_page_size(); > >> + start = paddr & ~(host_pgsz - 1); > >> + end = start + host_pgsz; > >> + while (start < end) { > >> + /* > >> + * The precise physical address is provided for the affected > >> + * guest page that contains @paddr. Otherwise, the starting > >> + * address of the guest page is provided. > >> + */ > >> + if (paddr >= start && paddr < (start + guest_pgsz)) { > >> + g_array_append_vals(addresses, &paddr, 1); > >> + } else { > >> + g_array_append_vals(addresses, &start, 1); > >> + } > >> + > >> + start += guest_pgsz; > >> + } > >> + > >> ret = acpi_ghes_memory_errors(ags, ACPI_HEST_SRC_ID_SYNC, addresses); > >> if (ret) { > >> goto error; > > >
