On Mon, Nov 3, 2025 at 10:17 AM Jose Marinho <[email protected]> wrote: > > Thank you for these patches.
Thanks for your comments, Jose! > > On 10/13/2025 7:59 PM, Jiaqi Yan wrote: > > When APEI fails to handle a stage-2 synchronous external abort (SEA), > > today KVM injects an asynchronous SError to the VCPU then resumes it, > > which usually results in unpleasant guest kernel panic. > > > > One major situation of guest SEA is when vCPU consumes recoverable > > uncorrected memory error (UER). Although SError and guest kernel panic > > effectively stops the propagation of corrupted memory, guest may > > re-use the corrupted memory if auto-rebooted; in worse case, guest > > boot may run into poisoned memory. So there is room to recover from > > an UER in a more graceful manner. > > > > Alternatively KVM can redirect the synchronous SEA event to VMM to > > - Reduce blast radius if possible. VMM can inject a SEA to VCPU via > > KVM's existing KVM_SET_VCPU_EVENTS API. If the memory poison > > consumption or fault is not from guest kernel, blast radius can be > > limited to the triggering thread in guest userspace, so VM can > > keep running. > > - Allow VMM to protect from future memory poison consumption by > > unmapping the page from stage-2, or to interrupt guest of the > > poisoned page so guest kernel can unmap it from stage-1 page table. > > - Allow VMM to track SEA events that VM customers care about, to restart > > VM when certain number of distinct poison events have happened, > > to provide observability to customers in log management UI. > > > > Introduce an userspace-visible feature to enable VMM handle SEA: > > - KVM_CAP_ARM_SEA_TO_USER. As the alternative fallback behavior > > when host APEI fails to claim a SEA, userspace can opt in this new > > capability to let KVM exit to userspace during SEA if it is not > > owned by host. > > - KVM_EXIT_ARM_SEA. A new exit reason is introduced for this. > > KVM fills kvm_run.arm_sea with as much as possible information about > > the SEA, enabling VMM to emulate SEA to guest by itself. > > - Sanitized ESR_EL2. The general rule is to keep only the bits > > useful for userspace and relevant to guest memory. > > - Flags indicating if faulting guest physical address is valid. > > - Faulting guest physical and virtual addresses if valid. > > > > Signed-off-by: Jiaqi Yan <[email protected]> > > Co-developed-by: Oliver Upton <[email protected]> > > Signed-off-by: Oliver Upton <[email protected]> > > --- > > arch/arm64/include/asm/kvm_host.h | 2 + > > arch/arm64/kvm/arm.c | 5 +++ > > arch/arm64/kvm/mmu.c | 68 ++++++++++++++++++++++++++++++- > > include/uapi/linux/kvm.h | 10 +++++ > > 4 files changed, 84 insertions(+), 1 deletion(-) > > > > diff --git a/arch/arm64/include/asm/kvm_host.h > > b/arch/arm64/include/asm/kvm_host.h > > index b763293281c88..e2c65b14e60c4 100644 > > --- a/arch/arm64/include/asm/kvm_host.h > > +++ b/arch/arm64/include/asm/kvm_host.h > > @@ -350,6 +350,8 @@ struct kvm_arch { > > #define KVM_ARCH_FLAG_GUEST_HAS_SVE 9 > > /* MIDR_EL1, REVIDR_EL1, and AIDR_EL1 are writable from userspace */ > > #define KVM_ARCH_FLAG_WRITABLE_IMP_ID_REGS 10 > > + /* Unhandled SEAs are taken to userspace */ > > +#define KVM_ARCH_FLAG_EXIT_SEA 11 > > unsigned long flags; > > > > /* VM-wide vCPU feature set */ > > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c > > index f21d1b7f20f8e..888600df79c40 100644 > > --- a/arch/arm64/kvm/arm.c > > +++ b/arch/arm64/kvm/arm.c > > @@ -132,6 +132,10 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm, > > } > > mutex_unlock(&kvm->lock); > > break; > > + case KVM_CAP_ARM_SEA_TO_USER: > > + r = 0; > > + set_bit(KVM_ARCH_FLAG_EXIT_SEA, &kvm->arch.flags); > > + break; > > default: > > break; > > } > > @@ -327,6 +331,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long > > ext) > > case KVM_CAP_IRQFD_RESAMPLE: > > case KVM_CAP_COUNTER_OFFSET: > > case KVM_CAP_ARM_WRITABLE_IMP_ID_REGS: > > + case KVM_CAP_ARM_SEA_TO_USER: > > r = 1; > > break; > > case KVM_CAP_SET_GUEST_DEBUG2: > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > > index 7cc964af8d305..09210b6ab3907 100644 > > --- a/arch/arm64/kvm/mmu.c > > +++ b/arch/arm64/kvm/mmu.c > > @@ -1899,8 +1899,48 @@ static void handle_access_fault(struct kvm_vcpu > > *vcpu, phys_addr_t fault_ipa) > > read_unlock(&vcpu->kvm->mmu_lock); > > } > > > > +/* > > + * Returns true if the SEA should be handled locally within KVM if the > > abort > > + * is caused by a kernel memory allocation (e.g. stage-2 table memory). > > + */ > > +static bool host_owns_sea(struct kvm_vcpu *vcpu, u64 esr) > > +{ > > + /* > > + * Without FEAT_RAS HCR_EL2.TEA is RES0, meaning any external abort > > + * taken from a guest EL to EL2 is due to a host-imposed access (e.g. > > + * stage-2 PTW). > > + */ > > + if (!cpus_have_final_cap(ARM64_HAS_RAS_EXTN)) > > + return true; > > + > > + /* KVM owns the VNCR when the vCPU isn't in a nested context. */ > > + if (is_hyp_ctxt(vcpu) && (esr & ESR_ELx_VNCR)) > Is this check valid only for a "Data Abort"? Yes, the VNCR bit is specific to a Data Abort (provided we can only reach host_owns_sea if kvm_vcpu_abt_issea). I don't think we need to explicitly exclude the check here for Instruction Abort. > > + return true; > > + > > + /* > > + * Determine if an external abort during a table walk happened at > > + * stage-2 is only possible with S1PTW is set. Otherwise, since KVM > nit: Is the first sentence correct? Oh, it should be "Determining ...". > > > + * sets HCR_EL2.TEA, SEAs due to a stage-1 walk (i.e. accessing the > > + * PA of the stage-1 descriptor) can reach here and are reported > > + * with a TTW ESR value. > > + */ > > + return (esr_fsc_is_sea_ttw(esr) && (esr & ESR_ELx_S1PTW)); > > +} > > + > > int kvm_handle_guest_sea(struct kvm_vcpu *vcpu) > > { > > + struct kvm *kvm = vcpu->kvm; > > + struct kvm_run *run = vcpu->run; > > + u64 esr = kvm_vcpu_get_esr(vcpu); > > + u64 esr_mask = ESR_ELx_EC_MASK | > > + ESR_ELx_IL | > > + ESR_ELx_FnV | > > + ESR_ELx_EA | > > + ESR_ELx_CM | > > + ESR_ELx_WNR | > > + ESR_ELx_FSC; > > + u64 ipa; > > + > > /* > > * Give APEI the opportunity to claim the abort before handling it > > * within KVM. apei_claim_sea() expects to be called with IRQs > > enabled. > > @@ -1909,7 +1949,33 @@ int kvm_handle_guest_sea(struct kvm_vcpu *vcpu) > > if (apei_claim_sea(NULL) == 0) > > return 1; > > > > - return kvm_inject_serror(vcpu); > > + if (host_owns_sea(vcpu, esr) || > > + !test_bit(KVM_ARCH_FLAG_EXIT_SEA, &vcpu->kvm->arch.flags)) > > + return kvm_inject_serror(vcpu); > > + > > + /* ESR_ELx.SET is RES0 when FEAT_RAS isn't implemented. */ > > + if (kvm_has_ras(kvm)) > > + esr_mask |= ESR_ELx_SET_MASK; > > + > > + /* > > + * Exit to userspace, and provide faulting guest virtual and physical > > + * addresses in case userspace wants to emulate SEA to guest by > > + * writing to FAR_ELx and HPFAR_ELx registers. > > + */ > > + memset(&run->arm_sea, 0, sizeof(run->arm_sea)); > > + run->exit_reason = KVM_EXIT_ARM_SEA; > > + run->arm_sea.esr = esr & esr_mask; > > + > > + if (!(esr & ESR_ELx_FnV)) > > + run->arm_sea.gva = kvm_vcpu_get_hfar(vcpu) > + > > + ipa = kvm_vcpu_get_fault_ipa(vcpu); > > + if (ipa != INVALID_GPA) { > > + run->arm_sea.flags |= KVM_EXIT_ARM_SEA_FLAG_GPA_VALID; > > + run->arm_sea.gpa = ipa; > > Are we interested in the value of PFAR_EL2 (if FEAT_PFAR implemented)? I don't think userspace (VMM) or the guest would need or make any use of the physical memory address. I believe host physical address in general should be hidden from userspace process. Completely off-topic: if FEAT_PFAR is implemented, I would propose EL3 RAS to implement something below so that host APEI can claim the SEA: 1. Triage the SEA to determine if it has to be handled in place, or should it be redirected to lower EL2. 2. If SEA should be redirected to EL2, craft a memory error CPER that contains a valid physical memory address. 3. When redirect a SEA to EL2, also populate it to host APEI GHES. This way, memory error caused SEA can properly trigger the normal memory_failure routine provided by host kernel, instead of handled as an exception without memory error context by KVM. > I believe kvm_vcpu_get_fault_ipa gets the HPFAR_EL2, which is valid for > S2 translation and GPC faults, but unknown for other cases. You are absolutely right that HPFAR_EL2 register is unknown for SEA. However, thanks to Oliver [1] KVM now performs a FAR to HPFAR address translation (__translate_far_to_hpfar) for certain SEA cases (see __fault_safe_to_translate), and stores the translation status + results in vcpu->arch.fault. These SEA cases are returned to userspace in this patchset. [1] https://lore.kernel.org/all/[email protected]. > > Jose > > > + } > > + > > + return 0; > > } > > > > /** > > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h > > index 6efa98a57ec11..acc7b3a346992 100644 > > --- a/include/uapi/linux/kvm.h > > +++ b/include/uapi/linux/kvm.h > > @@ -179,6 +179,7 @@ struct kvm_xen_exit { > > #define KVM_EXIT_LOONGARCH_IOCSR 38 > > #define KVM_EXIT_MEMORY_FAULT 39 > > #define KVM_EXIT_TDX 40 > > +#define KVM_EXIT_ARM_SEA 41 > > > > /* For KVM_EXIT_INTERNAL_ERROR */ > > /* Emulate instruction failed. */ > > @@ -473,6 +474,14 @@ struct kvm_run { > > } setup_event_notify; > > }; > > } tdx; > > + /* KVM_EXIT_ARM_SEA */ > > + struct { > > +#define KVM_EXIT_ARM_SEA_FLAG_GPA_VALID (1ULL << 0) > > + __u64 flags; > > + __u64 esr; > > + __u64 gva; > > + __u64 gpa; > > + } arm_sea; > > /* Fix the size of the union. */ > > char padding[256]; > > }; > > @@ -963,6 +972,7 @@ struct kvm_enable_cap { > > #define KVM_CAP_RISCV_MP_STATE_RESET 242 > > #define KVM_CAP_ARM_CACHEABLE_PFNMAP_SUPPORTED 243 > > #define KVM_CAP_GUEST_MEMFD_MMAP 244 > > +#define KVM_CAP_ARM_SEA_TO_USER 245 > > > > struct kvm_irq_routing_irqchip { > > __u32 irqchip;
