On Mon, 3 Nov 2025 10:52:16 +0100 Igor Mammedov <[email protected]> wrote:
> On Mon, 3 Nov 2025 09:02:54 +1000 > Gavin Shan <[email protected]> wrote: > > > On 10/31/25 11:55 PM, Igor Mammedov wrote: > > > On Sun, 19 Oct 2025 10:36:16 +1000 > > > Gavin Shan <[email protected]> wrote: > > >> On 10/18/25 12:27 AM, Igor Mammedov wrote: > > >>> On Tue, 7 Oct 2025 16:08:10 +1000 > > >>> Gavin Shan <[email protected]> wrote: > > >>> > > >>>> In the combination of 64KB host and 4KB guest, a problematic host page > > >>>> affects 16x guest pages. In this specific case, it's reasonable to > > >>>> push 16 consecutive memory CPERs. Otherwise, QEMU can run into core > > >>>> dump due to the current error can't be delivered as the previous error > > >>>> isn't acknoledges. It's caused by the nature the host page can be > > >>>> accessed in parallel due to the mismatched host and guest page sizes. > > >>>> > > >>> > > >>> can you explain a bit more what goes wrong? > > >>> > > >>> I'm especially interested in parallel access you've mentioned > > >>> and why batch adding error records is needed > > >>> as opposed to adding records every time invalid access happens? > > >>> > > >>> PS: > > >>> Assume I don't remember details on how HEST works, > > >>> Answering it in this format also should improve commit message > > >>> making it more digestible for uninitiated. > > >>> > > >> > > >> Thanks for your review and I'm trying to answer your question below. > > >> Please let > > >> me know if there are more questions. > > >> > > >> There are two signals (BUS_MCEERR_AR and BUS_MCEERR_AO) and > > >> BUS_MCEERR_AR is > > >> concerned here. This signal BUS_MCEERR_AR is sent by host's stage2 page > > >> fault > > >> handler when the resolved host page has been marked as marked as > > >> poisoned. > > >> The stage2 page fault handler is invoked on every access to the host > > >> page. > > >> > > >> In the combination where host and guest has 64KB and 4KB separately, A > > >> 64KB > > >> host page corresponds to 16x consecutive 4KB guest pages. It means we're > > >> accessing the 64KB host page when any of those 16x consecutive 4KB guest > > >> pages > > >> is accessed. In other words, a problematic 64KB host page affects the > > >> accesses > > >> on 16x 4KB guest pages. Those 16x 4KB guest pages can be owned by > > >> different > > >> threads on the guest and they run in parallel, potentially to access > > >> those > > >> 16x 4KB guest pages in parallel. It potentially leading to 16x > > >> BUS_MCEERR_AR > > >> signals at one point. > > >> > > >> In current implementation, the error record is built as the following > > >> calltrace > > >> indicates. There are 16 error records in the extreme case (parallel > > >> accesses on > > >> 16x 4KB guest pages, mapped to one 64KB host page). However, we can't > > >> handle > > >> multiple error records at once due to the acknowledgement mechanism in > > >> ghes_record_cper_errors(). For example, the first error record has been > > >> sent, > > >> but not consumed by the guest yet. We fail to send the second error > > >> record. > > >> > > >> kvm_arch_on_sigbus_vcpu > > >> acpi_ghes_memory_errors > > >> ghes_gen_err_data_uncorrectable_recoverable // Generic Error > > >> Data Entry > > >> acpi_ghes_build_append_mem_cper // Memory Error > > >> ghes_record_cper_errors > > >> > > >> So this series improves this situation by simply sending 16x error > > >> records in > > >> one shot for the combination of 64KB host + 4KB guest. > > > > > > 1) What I'm concerned about is that it target one specific case only. > > > Imagine if 1st cpu get error on page1 and another on > > > page2=(page1+host_page_size) > > > and so on for other CPUs. Then we are back where we were before this > > > series. > > > > > > Also in abstract future when ARM gets 1Gb pages, that won't scale well. > > > > > > Can we instead of making up CPERs to cover whole host page, > > > create 1/vcpu GHES source? > > > That way when vcpu trips over bad page, it would have its own > > > error status block to put errors in. > > > That would address [1] and deterministically scale > > > (well assuming that multiple SEA error sources are possible in theory) > > > > > > > I think it's a good idea to have individual error source for each vCPU. In > > this > > way, the read_ack_reg won't be a limitation. I hope Jonathan is ok to this > > scheme. > > > > Currently, we have two (fixed) error sources like below. I assume the index > > of > > the error source per vCPU will starts from (ACPI_HEST_SRC_ID_QMP + 1) based > > on > > CPUState::cpu_index. > > I'd suggest ditch cpu index and use arch_id instead. > > > > > enum AcpiGhesSourceID { > > ACPI_HEST_SRC_ID_SYNC, > > ACPI_HEST_SRC_ID_QMP, /* Use it only for QMP injected errors */ > > }; > > > > > > > PS: > > > I also wonder what real HW does when it gets in similar situation > > > (i.e. error status block is not yet acknowledged but another async > > > error arrived for the same error source)? On a real hardware platform it can just queue them in firmware or block until the second CPU to arrive can proceed. Once first is processed the second is then exposed. The particular case seen here of a sudden burst of errors in memory due to corruption of a larger range doesn't have an obvious equivalent as the CPER record carries granularity and you don't have the problem of 16 contiguous PA pages mapping to non contiguous GPA. > > > > > > > I believe real HW also have this specific issue. As to ARM64, I ever did > > some > > google search and was told the error follows the firmware-first policy and > > handled by a component of trustfirmware-a. However, I was unable to get the > > source code. So it's hard for me to know how this specific issue is handled > > there. First thing is that I'd never make any assumptions about consistency of how this handled on different arm arch platforms. The only consistency is at the ACPI / UEFI spec level. Ours for example often have management controllers for RAS handling that are completely independent of what is running on the CPUs (e.g. TFA). All that actually matters though is there is considerable software involved in marshalling this data and that marshalling is not always per PE. We have far fewer GHESv2 entries than we have CPUs but there are multiple of them with different types of error routed through particular subsets. I'm not sure of the exact way that this is done but we don't share GHESv2 for fatal and non fatal for instance. The kernel doesn't care what comes down each pipe so I've never looked at it closely. > > Perhaps Jonathan can help with finding how real hw works around it? > > My idea using per cpu source is just a speculation based on spec > on how workaround the problem, > I don't really know if guest OS will be able to handle it (aka, > need to be tested is it's viable). That also probably was a reason > in previous review, why should've waited for multiple sources > support be be merged first before this series. Per vCPU should work fine but I do like the approach here of reporting all the related errors in one go as they represent the underlying nature of the error granularity tracking. If anyone ever poisons at the 1GiB level on the host they are on their own - so I think that it will only ever be the finest granularity supported (so worse case 64KiB). Jonathan > > > > Thanks, > > Gavin > > > > >>>> Imporve push_ghes_memory_errors() to push 16x consecutive memory CPERs > > >>>> for this specific case. The maximal error block size is bumped to 4KB, > > >>>> providing enough storage space for those 16x memory CPERs. > > >>>> > > >>>> Signed-off-by: Gavin Shan <[email protected]> > > >>>> --- > > >>>> hw/acpi/ghes.c | 2 +- > > >>>> target/arm/kvm.c | 46 +++++++++++++++++++++++++++++++++++++++++++++- > > >>>> 2 files changed, 46 insertions(+), 2 deletions(-) > > >>>> > > >>>> diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c > > >>>> index 045b77715f..5c87b3a027 100644 > > >>>> --- a/hw/acpi/ghes.c > > >>>> +++ b/hw/acpi/ghes.c > > >>>> @@ -33,7 +33,7 @@ > > >>>> #define ACPI_HEST_ADDR_FW_CFG_FILE > > >>>> "etc/acpi_table_hest_addr" > > >>>> > > >>>> /* The max size in bytes for one error block */ > > >>>> -#define ACPI_GHES_MAX_RAW_DATA_LENGTH (1 * KiB) > > >>>> +#define ACPI_GHES_MAX_RAW_DATA_LENGTH (4 * KiB) > > >>>> > > >>>> /* Generic Hardware Error Source version 2 */ > > >>>> #define ACPI_GHES_SOURCE_GENERIC_ERROR_V2 10 > > >>>> diff --git a/target/arm/kvm.c b/target/arm/kvm.c > > >>>> index c5d5b3b16e..3ecb85e4b7 100644 > > >>>> --- a/target/arm/kvm.c > > >>>> +++ b/target/arm/kvm.c > > >>>> @@ -11,6 +11,7 @@ > > >>>> */ > > >>>> > > >>>> #include "qemu/osdep.h" > > >>>> +#include "qemu/units.h" > > >>>> #include <sys/ioctl.h> > > >>>> > > >>>> #include <linux/kvm.h> > > >>>> @@ -2433,10 +2434,53 @@ static void push_ghes_memory_errors(CPUState > > >>>> *c, AcpiGhesState *ags, > > >>>> uint64_t paddr) > > >>>> { > > >>>> GArray *addresses = g_array_new(false, false, sizeof(paddr)); > > >>>> + uint64_t val, start, end, guest_pgsz, host_pgsz; > > >>>> int ret; > > >>>> > > >>>> kvm_cpu_synchronize_state(c); > > >>>> - g_array_append_vals(addresses, &paddr, 1); > > >>>> + > > >>>> + /* > > >>>> + * Sort out the guest page size from TCR_EL1, which can be > > >>>> modified > > >>>> + * by the guest from time to time. So we have to sort it out > > >>>> dynamically. > > >>>> + */ > > >>>> + ret = read_sys_reg64(c->kvm_fd, &val, ARM64_SYS_REG(3, 0, 2, 0, > > >>>> 2)); > > >>>> + if (ret) { > > >>>> + goto error; > > >>>> + } > > >>>> + > > >>>> + switch (extract64(val, 14, 2)) { > > >>>> + case 0: > > >>>> + guest_pgsz = 4 * KiB; > > >>>> + break; > > >>>> + case 1: > > >>>> + guest_pgsz = 64 * KiB; > > >>>> + break; > > >>>> + case 2: > > >>>> + guest_pgsz = 16 * KiB; > > >>>> + break; > > >>>> + default: > > >>>> + error_report("unknown page size from TCR_EL1 (0x%" PRIx64 > > >>>> ")", val); > > >>>> + goto error; > > >>>> + } > > >>>> + > > >>>> + host_pgsz = qemu_real_host_page_size(); > > >>>> + start = paddr & ~(host_pgsz - 1); > > >>>> + end = start + host_pgsz; > > >>>> + while (start < end) { > > >>>> + /* > > >>>> + * The precise physical address is provided for the affected > > >>>> + * guest page that contains @paddr. Otherwise, the starting > > >>>> + * address of the guest page is provided. > > >>>> + */ > > >>>> + if (paddr >= start && paddr < (start + guest_pgsz)) { > > >>>> + g_array_append_vals(addresses, &paddr, 1); > > >>>> + } else { > > >>>> + g_array_append_vals(addresses, &start, 1); > > >>>> + } > > >>>> + > > >>>> + start += guest_pgsz; > > >>>> + } > > >>>> + > > >>>> ret = acpi_ghes_memory_errors(ags, ACPI_HEST_SRC_ID_SYNC, > > >>>> addresses); > > >>>> if (ret) { > > >>>> goto error; > > >>> > > >> > > > > > > >
