Hi Ackerley, On Thu, 26 Feb 2026 at 04:16, Ackerley Tng <[email protected]> wrote: > > Fuad Tabba <[email protected]> writes: > > > Hi Ackerley, > > > > Here are my thoughts, at least when it comes to pKVM. > > > > > > On Tue, 24 Feb 2026 at 10:14, Ackerley Tng <[email protected]> wrote: > >> > >> Ackerley Tng <[email protected]> writes: > >> > >> > Ackerley Tng <[email protected]> writes: > >> > > >> >> > >> >> [...snip...] > >> >> > >> > Before this lands, Sean wants, at the very minimum, an in-principle > >> > agreement on guest_memfd behavior with respect to whether or not memory > >> > should be preserved on conversion. > >> >> > >> >> [...snip...] > >> >> > >> > >> Here's what I've come up with, following up from last guest_memfd > >> biweekly. > >> > >> Every KVM_SET_MEMORY_ATTRIBUTES2 request will be accompanied by an > >> enum set_memory_attributes_content_policy: > >> > >> enum set_memory_attributes_content_policy { > >> SET_MEMORY_ATTRIBUTES_CONTENT_ZERO, > >> SET_MEMORY_ATTRIBUTES_CONTENT_READABLE, > >> SET_MEMORY_ATTRIBUTES_CONTENT_ENCRYPTED, > >> } > >> > >> Within guest_memfd's KVM_SET_MEMORY_ATTRIBUTES2 handler, guest_memfd > >> will make an arch call > >> > >> kvm_gmem_arch_content_policy_supported(kvm, policy, gfn, nr_pages) > >> > >> where every arch will get to return some error if the requested policy > >> is not supported for the given range. > > > > This hook provides the validation mechanism pKVM requires. > > > >> ZERO is the simplest of the above, it means that after the conversion > >> the memory will be zeroed for the next reader. > >> > >> + TDX and SNP today will support ZERO since the firmware handles > >> zeroing. > >> + pKVM and SW_PROTECTED_VM will apply software zeroing. > >> + Purpose: having this policy in the API allows userspace to be sure > >> that the memory is zeroed after the conversion - there is no need to > >> zero again in userspace (addresses concern that Sean pointed out) > >> > >> READABLE means that after the conversion, the memory is readable by > >> userspace (if converting to shared) or readable by the guest (if > >> converting to private). > >> > >> + TDX and SNP (today) can't support this, so return -EOPNOTSUPP > >> + SW_PROTECTED_VM will support this and do nothing extra on > >> conversion, since there is no encryption anyway and all content > >> remains readable. > >> + pKVM will make use of the arch function above. > >> > >> Here's where I need input: (David's questions during the call about > >> the full flow beginning with the guest prompted this). > >> > >> Since pKVM doesn't encrypt the memory contents, there must be some way > >> that pKVM can say no when userspace requests to convert and retain > >> READABLE contents? I think pKVM's arch function can be used to check > >> if the guest previously made a conversion request. Fuad, to check that > >> the guest made a conversion request, what's other parameters are > >> needed other than gfn and nr_pages? > > > > The gfn and nr_pages parameters are enough I think. > > > > To clarify how pKVM would use this hook: all memory sharing and > > unsharing must be initiated by the guest via a hypercall. When the > > guest issues this hypercall, the pKVM hypervisor (EL2) exits to the > > host kernel (EL1). The host kernel records the exit reason (share or > > unshare) along with the specific memory address in the kvm_run > > structure before exiting to userspace. > > > > We do not track this pending conversion state in the hypervisor. If a > > compromised host kernel wants to lie and corrupt the state, it can > > crash the system or the guest (which is an accepted DOS risk), but it > > cannot compromise guest confidentiality because EL2 still strictly > > enforces Stage-2 permissions. Our primary goal here is to prevent a > > malicious or buggy userspace VMM from crashing the system. > > > > Thinking through it again, there's actually no security (in terms of > CoCo confidentiality) risk here, since the conversion ioctl doesn't > actually tell the CoCo vendor/platform to encrypt/decrypt or flip > permissions, it just unmaps the pages as requested. > > On TDX, if a rogue private to shared conversion request comes in, the > private pages would get unmapped from the guest, and on the next guest > access, the guest would access the page as private, so kvm's fault > handler would think there's a shared/private mismatch and exit with > KVM_EXIT_MEMORY_FAULT. Userspace now has a zeroed shared page, and the > guest needs to re-accept the page to continue using it (if it knows what > to do with a zeroed page). This would be userspace DOS-ing the guest, > which userspace can do anyway. > > On pKVM, rephrasing what you said, even if there is a rogue private to > shared conversion, EL2 still thinks of the page as private. After the > conversion, the page can be faulted in by the host, but any access will > be stopped by EL2. > > David, there's no missing piece in the flow! > > > When the VMM subsequently issues the KVM_SET_MEMORY_ATTRIBUTES2 ioctl > > with the READABLE policy, we will use the > > kvm_gmem_arch_content_policy_supported() hook in EL1 to validate the > > ioctl. We will cross-reference the requested gfn and nr_pages against > > the pending exit reason stored in kvm_run. > > > > If the VMM attempts an unsolicited conversion (i.e., there is no > > matching exit request in kvm_run, or the addresses do not match), our > > Ah I see, so struct kvm_run is not considered "in the hypervisor" since > it is modifiable by host userspace. Would you be using struct > memory_fault in struct kvm_run? > > Which vcpu's kvm_run struct would you look up from > kvm_gmem_arch_content_policy_supported()? > > For this to land, do you still want the gfn and nr_pages parameters? > > Can pKVM just always accept the request, whether the guest requested it > or not? Thinking about it again, > kvm_gmem_arch_content_policy_supported() probably shouldn't be used to > guard solicited vs unsolicited requests anyway (unless you think the > function's name should be changed?)
You spotted a flaw in my proposed validation mechanic. As you said, KVM_SET_MEMORY_ATTRIBUTES2 is a VM-scoped ioctl, meaning we lack the vCPU context required to safely inspect a specific kvm_run structure. Attempting to track this state VM-wide in struct kvm_arch would require introducing a lock on a hot memory-transition path, which we want to avoid. Your assessment of the security implications and the TDX fallback mechanism is correct and applies to pKVM. Because EL2 maintains the ultimate source of truth regarding Stage-2 permissions, a rogue conversion by the VMM cannot breach guest confidentiality. Given this, it makes sense for pKVM will adopt the TDX approach. We will drop the requirement to synchronously validate the guest's intent during the ioctl. We will always accept the request in EL1. If the VMM is lying and the guest never actually authorized the share, the resulting attribute mismatch will trigger a KVM_EXIT_MEMORY_FAULT upon the next guest access, or a SIGBUS/SIGSEGV if the host userspace attempts to map and read the memory. This will DOS/crash the VMM without endangering the host kernel. Therefore, kvm_gmem_arch_content_policy_supported() should not be used for dynamic state-machine validation. We will use it strictly as a static capability check (e.g., verifying that the requested policy is architecturally possible). I think that we would still require gfn and nr_pages in the hook to allow for potential range-based capability checks in the future, but not use them to cross-reference pending conversion requests. Cheers, /fuad > > current plan is to reject the request and return an error. In the > > future, rather than outright rejecting an unsolicited conversion, we > > might evolve this to treat it as a host-initiated destructive reclaim, > > forcing an unshare and zeroing the memory. For the time being, > > explicit rejection is the simplest and safest path. > > > > >> ENCRYPTED means that after the conversion, the memory contents are > >> retained as-is, with no decryption. > >> > >> + TDX and SNP (today) can't support this, so return -EOPNOTSUPP > >> + pKVM and SW_PROTECTED_VM can do nothing, but doing nothing retains > >> READABLE content, not ENCRYPTED content, hence SW_PROTECTED_VM > >> should return -EOPNOTSUPP. > >> + Michael, you mentioned during the call that SNP is planning to > >> introduce a policy that retains the ENCRYPTED version for a special > >> GHCB call. ENCRYPTED is meant for that use case. Does it work? I'm > >> assuming that SNP should only support this policy given some > >> conditions, so would the arch call as described above work? > >> + If this policy is specified on conversion from shared to private, > >> always return -EOPNOTSUPP. > >> + When this first lands, ENCRYPTED will not be a valid option, but I'm > >> listing it here so we have line of sight to having this support. > >> > >> READABLE and ENCRYPTED defines the state after conversion clearly > >> (instead of DONT_CARE or similar). > >> > >> DESTROY could be another policy, which means that after the > >> conversion, the memory is unreadable. This is the option to address > >> what David brought up during the call, for cases where userspace knows > >> it is going to free the memory already and doesn't care about the > >> state as long as nobody gets to read it. This will not implemented > >> when feature first lands, but is presented here just to show how this > >> can be extended in future. > >> > >> Right now, I'm thinking that one of the above policies MUST be > >> specified (not specifying a policy will result in -EINVAL). > >> > >> How does this sound? > > > > I don't think that returning -EINVAL is the right thing to do here. If > > userspace omits the policy, the API should default to > > SET_MEMORY_ATTRIBUTES_CONTENT_ZERO and proceed with the conversion. I > > believe that, in Linux APIs in general, omitting an optional behavior > > flag results in the safest, most standard default action. > > > > Makes sense, I'll default to zeroing then. Thanks! > > > Also, returning -EINVAL when no policy is specified makes the policy > > parameter strictly mandatory. This makes it difficult for userspace's > > to seamlessly request clean-slate, destructive conversions. Software > > zeroing ensures deterministic behavior across pKVM, TDX, and SNP, > > isolating the KVM uAPI from micro-architectural data destruction > > nuances. > > > > Cheers, > > /fuad
