Fuad Tabba <[email protected]> writes: > Hi Ackerley, > > Here are my thoughts, at least when it comes to pKVM. > > > On Tue, 24 Feb 2026 at 10:14, Ackerley Tng <[email protected]> wrote: >> >> Ackerley Tng <[email protected]> writes: >> >> > Ackerley Tng <[email protected]> writes: >> > >> >> >> >> [...snip...] >> >> >> > Before this lands, Sean wants, at the very minimum, an in-principle >> > agreement on guest_memfd behavior with respect to whether or not memory >> > should be preserved on conversion. >> >> >> >> [...snip...] >> >> >> >> Here's what I've come up with, following up from last guest_memfd >> biweekly. >> >> Every KVM_SET_MEMORY_ATTRIBUTES2 request will be accompanied by an >> enum set_memory_attributes_content_policy: >> >> enum set_memory_attributes_content_policy { >> SET_MEMORY_ATTRIBUTES_CONTENT_ZERO, >> SET_MEMORY_ATTRIBUTES_CONTENT_READABLE, >> SET_MEMORY_ATTRIBUTES_CONTENT_ENCRYPTED, >> } >> >> Within guest_memfd's KVM_SET_MEMORY_ATTRIBUTES2 handler, guest_memfd >> will make an arch call >> >> kvm_gmem_arch_content_policy_supported(kvm, policy, gfn, nr_pages) >> >> where every arch will get to return some error if the requested policy >> is not supported for the given range. > > This hook provides the validation mechanism pKVM requires. > >> ZERO is the simplest of the above, it means that after the conversion >> the memory will be zeroed for the next reader. >> >> + TDX and SNP today will support ZERO since the firmware handles >> zeroing. >> + pKVM and SW_PROTECTED_VM will apply software zeroing. >> + Purpose: having this policy in the API allows userspace to be sure >> that the memory is zeroed after the conversion - there is no need to >> zero again in userspace (addresses concern that Sean pointed out) >> >> READABLE means that after the conversion, the memory is readable by >> userspace (if converting to shared) or readable by the guest (if >> converting to private). >> >> + TDX and SNP (today) can't support this, so return -EOPNOTSUPP >> + SW_PROTECTED_VM will support this and do nothing extra on >> conversion, since there is no encryption anyway and all content >> remains readable. >> + pKVM will make use of the arch function above. >> >> Here's where I need input: (David's questions during the call about >> the full flow beginning with the guest prompted this). >> >> Since pKVM doesn't encrypt the memory contents, there must be some way >> that pKVM can say no when userspace requests to convert and retain >> READABLE contents? I think pKVM's arch function can be used to check >> if the guest previously made a conversion request. Fuad, to check that >> the guest made a conversion request, what's other parameters are >> needed other than gfn and nr_pages? > > The gfn and nr_pages parameters are enough I think. > > To clarify how pKVM would use this hook: all memory sharing and > unsharing must be initiated by the guest via a hypercall. When the > guest issues this hypercall, the pKVM hypervisor (EL2) exits to the > host kernel (EL1). The host kernel records the exit reason (share or > unshare) along with the specific memory address in the kvm_run > structure before exiting to userspace. > > We do not track this pending conversion state in the hypervisor. If a > compromised host kernel wants to lie and corrupt the state, it can > crash the system or the guest (which is an accepted DOS risk), but it > cannot compromise guest confidentiality because EL2 still strictly > enforces Stage-2 permissions. Our primary goal here is to prevent a > malicious or buggy userspace VMM from crashing the system. >
Thinking through it again, there's actually no security (in terms of CoCo confidentiality) risk here, since the conversion ioctl doesn't actually tell the CoCo vendor/platform to encrypt/decrypt or flip permissions, it just unmaps the pages as requested. On TDX, if a rogue private to shared conversion request comes in, the private pages would get unmapped from the guest, and on the next guest access, the guest would access the page as private, so kvm's fault handler would think there's a shared/private mismatch and exit with KVM_EXIT_MEMORY_FAULT. Userspace now has a zeroed shared page, and the guest needs to re-accept the page to continue using it (if it knows what to do with a zeroed page). This would be userspace DOS-ing the guest, which userspace can do anyway. On pKVM, rephrasing what you said, even if there is a rogue private to shared conversion, EL2 still thinks of the page as private. After the conversion, the page can be faulted in by the host, but any access will be stopped by EL2. David, there's no missing piece in the flow! > When the VMM subsequently issues the KVM_SET_MEMORY_ATTRIBUTES2 ioctl > with the READABLE policy, we will use the > kvm_gmem_arch_content_policy_supported() hook in EL1 to validate the > ioctl. We will cross-reference the requested gfn and nr_pages against > the pending exit reason stored in kvm_run. > > If the VMM attempts an unsolicited conversion (i.e., there is no > matching exit request in kvm_run, or the addresses do not match), our Ah I see, so struct kvm_run is not considered "in the hypervisor" since it is modifiable by host userspace. Would you be using struct memory_fault in struct kvm_run? Which vcpu's kvm_run struct would you look up from kvm_gmem_arch_content_policy_supported()? For this to land, do you still want the gfn and nr_pages parameters? Can pKVM just always accept the request, whether the guest requested it or not? Thinking about it again, kvm_gmem_arch_content_policy_supported() probably shouldn't be used to guard solicited vs unsolicited requests anyway (unless you think the function's name should be changed?) > current plan is to reject the request and return an error. In the > future, rather than outright rejecting an unsolicited conversion, we > might evolve this to treat it as a host-initiated destructive reclaim, > forcing an unshare and zeroing the memory. For the time being, > explicit rejection is the simplest and safest path. > >> ENCRYPTED means that after the conversion, the memory contents are >> retained as-is, with no decryption. >> >> + TDX and SNP (today) can't support this, so return -EOPNOTSUPP >> + pKVM and SW_PROTECTED_VM can do nothing, but doing nothing retains >> READABLE content, not ENCRYPTED content, hence SW_PROTECTED_VM >> should return -EOPNOTSUPP. >> + Michael, you mentioned during the call that SNP is planning to >> introduce a policy that retains the ENCRYPTED version for a special >> GHCB call. ENCRYPTED is meant for that use case. Does it work? I'm >> assuming that SNP should only support this policy given some >> conditions, so would the arch call as described above work? >> + If this policy is specified on conversion from shared to private, >> always return -EOPNOTSUPP. >> + When this first lands, ENCRYPTED will not be a valid option, but I'm >> listing it here so we have line of sight to having this support. >> >> READABLE and ENCRYPTED defines the state after conversion clearly >> (instead of DONT_CARE or similar). >> >> DESTROY could be another policy, which means that after the >> conversion, the memory is unreadable. This is the option to address >> what David brought up during the call, for cases where userspace knows >> it is going to free the memory already and doesn't care about the >> state as long as nobody gets to read it. This will not implemented >> when feature first lands, but is presented here just to show how this >> can be extended in future. >> >> Right now, I'm thinking that one of the above policies MUST be >> specified (not specifying a policy will result in -EINVAL). >> >> How does this sound? > > I don't think that returning -EINVAL is the right thing to do here. If > userspace omits the policy, the API should default to > SET_MEMORY_ATTRIBUTES_CONTENT_ZERO and proceed with the conversion. I > believe that, in Linux APIs in general, omitting an optional behavior > flag results in the safest, most standard default action. > Makes sense, I'll default to zeroing then. Thanks! > Also, returning -EINVAL when no policy is specified makes the policy > parameter strictly mandatory. This makes it difficult for userspace's > to seamlessly request clean-slate, destructive conversions. Software > zeroing ensures deterministic behavior across pKVM, TDX, and SNP, > isolating the KVM uAPI from micro-architectural data destruction > nuances. > > Cheers, > /fuad
