On Mon, 3 Nov 2025 09:02:54 +1000 Gavin Shan <[email protected]> wrote:
> On 10/31/25 11:55 PM, Igor Mammedov wrote: > > On Sun, 19 Oct 2025 10:36:16 +1000 > > Gavin Shan <[email protected]> wrote: > >> On 10/18/25 12:27 AM, Igor Mammedov wrote: > >>> On Tue, 7 Oct 2025 16:08:10 +1000 > >>> Gavin Shan <[email protected]> wrote: > >>> > >>>> In the combination of 64KB host and 4KB guest, a problematic host page > >>>> affects 16x guest pages. In this specific case, it's reasonable to > >>>> push 16 consecutive memory CPERs. Otherwise, QEMU can run into core > >>>> dump due to the current error can't be delivered as the previous error > >>>> isn't acknoledges. It's caused by the nature the host page can be > >>>> accessed in parallel due to the mismatched host and guest page sizes. > >>> > >>> can you explain a bit more what goes wrong? > >>> > >>> I'm especially interested in parallel access you've mentioned > >>> and why batch adding error records is needed > >>> as opposed to adding records every time invalid access happens? > >>> > >>> PS: > >>> Assume I don't remember details on how HEST works, > >>> Answering it in this format also should improve commit message > >>> making it more digestible for uninitiated. > >>> > >> > >> Thanks for your review and I'm trying to answer your question below. > >> Please let > >> me know if there are more questions. > >> > >> There are two signals (BUS_MCEERR_AR and BUS_MCEERR_AO) and BUS_MCEERR_AR > >> is > >> concerned here. This signal BUS_MCEERR_AR is sent by host's stage2 page > >> fault > >> handler when the resolved host page has been marked as marked as poisoned. > >> The stage2 page fault handler is invoked on every access to the host page. > >> > >> In the combination where host and guest has 64KB and 4KB separately, A 64KB > >> host page corresponds to 16x consecutive 4KB guest pages. It means we're > >> accessing the 64KB host page when any of those 16x consecutive 4KB guest > >> pages > >> is accessed. In other words, a problematic 64KB host page affects the > >> accesses > >> on 16x 4KB guest pages. Those 16x 4KB guest pages can be owned by different > >> threads on the guest and they run in parallel, potentially to access those > >> 16x 4KB guest pages in parallel. It potentially leading to 16x > >> BUS_MCEERR_AR > >> signals at one point. > >> > >> In current implementation, the error record is built as the following > >> calltrace > >> indicates. There are 16 error records in the extreme case (parallel > >> accesses on > >> 16x 4KB guest pages, mapped to one 64KB host page). However, we can't > >> handle > >> multiple error records at once due to the acknowledgement mechanism in > >> ghes_record_cper_errors(). For example, the first error record has been > >> sent, > >> but not consumed by the guest yet. We fail to send the second error record. > >> > >> kvm_arch_on_sigbus_vcpu > >> acpi_ghes_memory_errors > >> ghes_gen_err_data_uncorrectable_recoverable // Generic Error > >> Data Entry > >> acpi_ghes_build_append_mem_cper // Memory Error > >> ghes_record_cper_errors > >> > >> So this series improves this situation by simply sending 16x error records > >> in > >> one shot for the combination of 64KB host + 4KB guest. > > > > 1) What I'm concerned about is that it target one specific case only. > > Imagine if 1st cpu get error on page1 and another on > > page2=(page1+host_page_size) > > and so on for other CPUs. Then we are back where we were before this series. > > > > Also in abstract future when ARM gets 1Gb pages, that won't scale well. > > > > Can we instead of making up CPERs to cover whole host page, > > create 1/vcpu GHES source? > > That way when vcpu trips over bad page, it would have its own > > error status block to put errors in. > > That would address [1] and deterministically scale > > (well assuming that multiple SEA error sources are possible in theory) > > > > I think it's a good idea to have individual error source for each vCPU. In > this > way, the read_ack_reg won't be a limitation. I hope Jonathan is ok to this > scheme. > > Currently, we have two (fixed) error sources like below. I assume the index of > the error source per vCPU will starts from (ACPI_HEST_SRC_ID_QMP + 1) based on > CPUState::cpu_index. I'd suggest ditch cpu index and use arch_id instead. > > enum AcpiGhesSourceID { > ACPI_HEST_SRC_ID_SYNC, > ACPI_HEST_SRC_ID_QMP, /* Use it only for QMP injected errors */ > }; > > > > PS: > > I also wonder what real HW does when it gets in similar situation > > (i.e. error status block is not yet acknowledged but another async > > error arrived for the same error source)? > > > > I believe real HW also have this specific issue. As to ARM64, I ever did some > google search and was told the error follows the firmware-first policy and > handled by a component of trustfirmware-a. However, I was unable to get the > source code. So it's hard for me to know how this specific issue is handled > there. Perhaps Jonathan can help with finding how real hw works around it? My idea using per cpu source is just a speculation based on spec on how workaround the problem, I don't really know if guest OS will be able to handle it (aka, need to be tested is it's viable). That also probably was a reason in previous review, why should've waited for multiple sources support be be merged first before this series. > Thanks, > Gavin > > >>>> Imporve push_ghes_memory_errors() to push 16x consecutive memory CPERs > >>>> for this specific case. The maximal error block size is bumped to 4KB, > >>>> providing enough storage space for those 16x memory CPERs. > >>>> > >>>> Signed-off-by: Gavin Shan <[email protected]> > >>>> --- > >>>> hw/acpi/ghes.c | 2 +- > >>>> target/arm/kvm.c | 46 +++++++++++++++++++++++++++++++++++++++++++++- > >>>> 2 files changed, 46 insertions(+), 2 deletions(-) > >>>> > >>>> diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c > >>>> index 045b77715f..5c87b3a027 100644 > >>>> --- a/hw/acpi/ghes.c > >>>> +++ b/hw/acpi/ghes.c > >>>> @@ -33,7 +33,7 @@ > >>>> #define ACPI_HEST_ADDR_FW_CFG_FILE "etc/acpi_table_hest_addr" > >>>> > >>>> /* The max size in bytes for one error block */ > >>>> -#define ACPI_GHES_MAX_RAW_DATA_LENGTH (1 * KiB) > >>>> +#define ACPI_GHES_MAX_RAW_DATA_LENGTH (4 * KiB) > >>>> > >>>> /* Generic Hardware Error Source version 2 */ > >>>> #define ACPI_GHES_SOURCE_GENERIC_ERROR_V2 10 > >>>> diff --git a/target/arm/kvm.c b/target/arm/kvm.c > >>>> index c5d5b3b16e..3ecb85e4b7 100644 > >>>> --- a/target/arm/kvm.c > >>>> +++ b/target/arm/kvm.c > >>>> @@ -11,6 +11,7 @@ > >>>> */ > >>>> > >>>> #include "qemu/osdep.h" > >>>> +#include "qemu/units.h" > >>>> #include <sys/ioctl.h> > >>>> > >>>> #include <linux/kvm.h> > >>>> @@ -2433,10 +2434,53 @@ static void push_ghes_memory_errors(CPUState *c, > >>>> AcpiGhesState *ags, > >>>> uint64_t paddr) > >>>> { > >>>> GArray *addresses = g_array_new(false, false, sizeof(paddr)); > >>>> + uint64_t val, start, end, guest_pgsz, host_pgsz; > >>>> int ret; > >>>> > >>>> kvm_cpu_synchronize_state(c); > >>>> - g_array_append_vals(addresses, &paddr, 1); > >>>> + > >>>> + /* > >>>> + * Sort out the guest page size from TCR_EL1, which can be modified > >>>> + * by the guest from time to time. So we have to sort it out > >>>> dynamically. > >>>> + */ > >>>> + ret = read_sys_reg64(c->kvm_fd, &val, ARM64_SYS_REG(3, 0, 2, 0, 2)); > >>>> + if (ret) { > >>>> + goto error; > >>>> + } > >>>> + > >>>> + switch (extract64(val, 14, 2)) { > >>>> + case 0: > >>>> + guest_pgsz = 4 * KiB; > >>>> + break; > >>>> + case 1: > >>>> + guest_pgsz = 64 * KiB; > >>>> + break; > >>>> + case 2: > >>>> + guest_pgsz = 16 * KiB; > >>>> + break; > >>>> + default: > >>>> + error_report("unknown page size from TCR_EL1 (0x%" PRIx64 ")", > >>>> val); > >>>> + goto error; > >>>> + } > >>>> + > >>>> + host_pgsz = qemu_real_host_page_size(); > >>>> + start = paddr & ~(host_pgsz - 1); > >>>> + end = start + host_pgsz; > >>>> + while (start < end) { > >>>> + /* > >>>> + * The precise physical address is provided for the affected > >>>> + * guest page that contains @paddr. Otherwise, the starting > >>>> + * address of the guest page is provided. > >>>> + */ > >>>> + if (paddr >= start && paddr < (start + guest_pgsz)) { > >>>> + g_array_append_vals(addresses, &paddr, 1); > >>>> + } else { > >>>> + g_array_append_vals(addresses, &start, 1); > >>>> + } > >>>> + > >>>> + start += guest_pgsz; > >>>> + } > >>>> + > >>>> ret = acpi_ghes_memory_errors(ags, ACPI_HEST_SRC_ID_SYNC, > >>>> addresses); > >>>> if (ret) { > >>>> goto error; > >>> > >> > > >
