On Tue, May 12, 2026 at 06:30:00AM -0700, Sean Christopherson wrote:
> On Tue, May 12, 2026, Fuad Tabba wrote:
> > On Mon, 11 May 2026 at 21:25, Sean Christopherson <[email protected]> wrote:
> > > > I used kvm_vm_release() because it's the only public API that closes
> > > > vm->fd to trigger kernel-side destruction.  But the existing callers
> > > > follow it with vm_recreate_with_one_vcpu(), so the "release + later
> > > > kvm_vm_free()" path isn't exercised today.
> > > >
> > > > I see three ways to make this clean:
> > > >   a) This patch: kvm_vm_release() becomes idempotent for its three
> > > >      FDs, matching the kvm_stats_release() idiom it already invokes.
> > > >   b) Leave kvm_vm_release() as-is and add a dedicated helper, e.g.
> > > >      kvm_vm_destroy_kernel(), that closes vm->fd to trigger kernel
> > > >      destruction while leaving the kvm_vm struct intact for
> > > >      post-destruction inspection.  kvm_vm_free() learns to handle the
> > > >      half-released state.
> > > >   c) Something else entirely, e.g., the test should manage vm->fd
> > > >      directly and not rely on library helpers for this pattern.
> > >
> > >     d) Fully kill the VM; validate the semantics with an explict mmap().
> > >
> > > The entire point of the test you are writing is to verfiy that a 
> > > guest_memfd VMA
> > > doesn't somehow cause KVM to leak state.  So, make that obvious instead 
> > > of abusing
> > > APIs that kinda sorta do what you want, but not really.
> > >
> > >         mem = kvm_mmap(region->mmap_size, PROT_READ | PROT_WRITE, 
> > > MAP_SHARED,
> > >                        region->guest_memfd);
> > >
> > >         ...
> > >
> > >         kvm_vm_free(vm);
> > >
> > >         TEST_ASSERT(is_zero(mem, ...));
> > 
> > The test isn't about guest_memfd.  The pKVM support that just landed
> > via Will's series [1] 
> 
> Landed where?  Is pKVM actually going upstream with anonymous memory?

The basic memory protection code for pKVM using anonymous memory is now
upstream but we still have a tonne to do for CPU state, DMA, firmware
loading etc. It obviously doesn't preclude support for guest_memfd in
future and, as you know, Fuad has been actively involved in that
discussion over the years. It does, however, mean that we'd be offering
guest_memfd alongside anonymous memory, which matches what we'd have to
do in Android at this point anyway (given the millions of devices using
anonymous memory in the field). Of course, if that means we should have
an additional set of tests, then that's fine.

> I thought the inability to protect against page faults in the untrusted
> kernel was a non-starter?

We ended up solving that by forcefully reclaiming the target page from
the guest. So the flow looks something like:

1. The host kernel accesses a private (faulting) page via a kernel mapping
2. An exception is taken to the hypervisor
3. The hypervisor injects an exception back into the kernel, with
   sufficient syndrome information to triage it as an access to a private
   page.
4. The kernel searches its usual exception handlers (for things like
   load_unaligned_zeropad())
5. If no handlers are found, the kernel translates the faulting virtual
   address into a physical address using the 'AT' instruction and issues
   a hypercall to forcefully reclaim the page.
6. The hypervisor unmaps the page from the guest and replaces the guest
   pte with an immutable "poison" (faulting) entry. If a vCPU later
   faults on this entry, it returns to userspace with -EFAULT.
7. The hypervisor clears the unmapped page, maps it back into the host
   and returns from the hypercall.
8. The kernel treats the fault as "spurious" and retries the faulting
   instruction.

If the faulting access in (1) instead came from userspace, the kernel
injects a SEGV at (4). If the faulting access came from a uaccess
routine (e.g. copy_from_user()), then the fixup handler would run like
normal in (4).

Will

Reply via email to