[+ Vladimir, who was also looking at this patch]

On Tue, 13 Jan 2026 15:26:40 +0000,
Jack Thomson <[email protected]> wrote:
> 
> From: Jack Thomson <[email protected]>
> 
> Add kvm_arch_vcpu_pre_fault_memory() for arm64. The implementation hands
> off the stage-2 faulting logic to either gmem_abort() or
> user_mem_abort().
> 
> Add an optional page_size output parameter to user_mem_abort() to
> return the VMA page size, which is needed when pre-faulting.
> 
> Update the documentation to clarify x86 specific behaviour.
> 
> Signed-off-by: Jack Thomson <[email protected]>
> ---
>  Documentation/virt/kvm/api.rst |  3 +-
>  arch/arm64/kvm/Kconfig         |  1 +
>  arch/arm64/kvm/arm.c           |  1 +
>  arch/arm64/kvm/mmu.c           | 79 ++++++++++++++++++++++++++++++++--
>  4 files changed, 79 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 01a3abef8abb..44cfd9e736bb 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6493,7 +6493,8 @@ Errors:
>  KVM_PRE_FAULT_MEMORY populates KVM's stage-2 page tables used to map memory
>  for the current vCPU state.  KVM maps memory as if the vCPU generated a
>  stage-2 read page fault, e.g. faults in memory as needed, but doesn't break
> -CoW.  However, KVM does not mark any newly created stage-2 PTE as Accessed.
> +CoW.  However, on x86, KVM does not mark any newly created stage-2 PTE as
> +Accessed.
>  
>  In the case of confidential VM types where there is an initial set up of
>  private guest memory before the guest is 'finalized'/measured, this ioctl
> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
> index 4f803fd1c99a..6872aaabe16c 100644
> --- a/arch/arm64/kvm/Kconfig
> +++ b/arch/arm64/kvm/Kconfig
> @@ -25,6 +25,7 @@ menuconfig KVM
>       select HAVE_KVM_CPU_RELAX_INTERCEPT
>       select KVM_MMIO
>       select KVM_GENERIC_DIRTYLOG_READ_PROTECT
> +     select KVM_GENERIC_PRE_FAULT_MEMORY
>       select VIRT_XFER_TO_GUEST_WORK
>       select KVM_VFIO
>       select HAVE_KVM_DIRTY_RING_ACQ_REL
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 4f80da0c0d1d..19bac68f737f 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -332,6 +332,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
> ext)
>       case KVM_CAP_COUNTER_OFFSET:
>       case KVM_CAP_ARM_WRITABLE_IMP_ID_REGS:
>       case KVM_CAP_ARM_SEA_TO_USER:
> +     case KVM_CAP_PRE_FAULT_MEMORY:
>               r = 1;
>               break;
>       case KVM_CAP_SET_GUEST_DEBUG2:
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 48d7c372a4cd..499b131f794e 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1642,8 +1642,8 @@ static int gmem_abort(struct kvm_vcpu *vcpu, 
> phys_addr_t fault_ipa,
>  
>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>                         struct kvm_s2_trans *nested,
> -                       struct kvm_memory_slot *memslot, unsigned long hva,
> -                       bool fault_is_perm)
> +                       struct kvm_memory_slot *memslot, unsigned long 
> *page_size,
> +                       unsigned long hva, bool fault_is_perm)
>  {
>       int ret = 0;
>       bool topup_memcache;
> @@ -1923,6 +1923,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> phys_addr_t fault_ipa,
>       kvm_release_faultin_page(kvm, page, !!ret, writable);
>       kvm_fault_unlock(kvm);
>  
> +     if (page_size)
> +             *page_size = vma_pagesize;
> +
>       /* Mark the page dirty only if the fault is handled successfully */
>       if (writable && !ret)
>               mark_page_dirty_in_slot(kvm, memslot, gfn);
> @@ -2196,8 +2199,8 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
>               ret = gmem_abort(vcpu, fault_ipa, nested, memslot,
>                                esr_fsc_is_permission_fault(esr));
>       else
> -             ret = user_mem_abort(vcpu, fault_ipa, nested, memslot, hva,
> -                                  esr_fsc_is_permission_fault(esr));
> +             ret = user_mem_abort(vcpu, fault_ipa, nested, memslot, NULL,
> +                                  hva, esr_fsc_is_permission_fault(esr));
>       if (ret == 0)
>               ret = 1;
>  out:
> @@ -2573,3 +2576,71 @@ void kvm_toggle_cache(struct kvm_vcpu *vcpu, bool 
> was_enabled)
>  
>       trace_kvm_toggle_cache(*vcpu_pc(vcpu), was_enabled, now_enabled);
>  }
> +
> +long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
> +                                 struct kvm_pre_fault_memory *range)
> +{
> +     struct kvm_vcpu_fault_info *fault_info = &vcpu->arch.fault;
> +     struct kvm_s2_trans nested_trans, *nested = NULL;
> +     unsigned long page_size = PAGE_SIZE;
> +     struct kvm_memory_slot *memslot;
> +     phys_addr_t ipa = range->gpa;
> +     phys_addr_t end;
> +     hva_t hva;
> +     gfn_t gfn;
> +     int ret;
> +
> +     if (vcpu_is_protected(vcpu))
> +             return -EOPNOTSUPP;

This feels pretty odd. If you have advertised the capability, then
saying "not supported" at this stage is not on.

> +
> +     /*
> +      * We may prefault on a shadow stage 2 page table if we are
> +      * running a nested guest.  In this case, we have to resolve the L2
> +      * IPA to the L1 IPA first, before knowing what kind of memory should
> +      * back the L1 IPA.
> +      *
> +      * If the shadow stage 2 page table walk faults, then we return
> +      * -EFAULT
> +      */
> +     if (kvm_is_nested_s2_mmu(vcpu->kvm, vcpu->arch.hw_mmu) &&
> +         vcpu->arch.hw_mmu->nested_stage2_enabled) {
> +             ret = kvm_walk_nested_s2(vcpu, ipa, &nested_trans);
> +             if (ret)
> +                     return -EFAULT;

And then what? Userspace is completely screwed here, with no way to
make any forward progress, because the L1 is in charge of that S2, and
L1 is not running. What's the outcome? Light a candle and pray?

Also, the IPA you are passing as a parameter means absolutely nothing
in the context of L2. Userspace doesn't have the faintest clue about
the memory map presented to L2, as that's L1 business. L1 can
absolutely present to L2 a memory map that doesn't have a single
address in common with its own.

So this really doesn't work at all.

> +
> +             ipa = kvm_s2_trans_output(&nested_trans);
> +             nested = &nested_trans;
> +     }
> +
> +     if (ipa >= kvm_phys_size(vcpu->arch.hw_mmu))
> +             return -ENOENT;
> +
> +     /* Generate a synthetic abort for the pre-fault address */
> +     fault_info->esr_el2 = (ESR_ELx_EC_DABT_LOW << ESR_ELx_EC_SHIFT) |
> +             ESR_ELx_FSC_FAULT_L(KVM_PGTABLE_LAST_LEVEL);

Why level 3? You must present a fault that matches the level at which
the emulated fault would actually occur, because the rest of the
infrastructure relies on that (at least on the permission path, and
more to come).

Taking a step back on all this, 90% of the problems are there because
you are trying to support prefaulting a guest that is already running.
If you limited this to actually *pre*-faulting the guest, it would be
the easiest thing ever, and wouldn't suffer from any of the above (you
can't be in a nested context if you haven't run).

What prevents you from doing so? I'm perfectly happy to make this a
separate API if this contradicts other implementations. Or are you
relying on other side effects of the "already running" state?

Thanks,

        M.

-- 
Without deviation from the norm, progress is not possible.

Reply via email to