Sean Christopherson <[email protected]> writes:

> On Thu, Mar 12, 2026, Fuad Tabba wrote:
>> Hi Ackerley,
>>
>> Before getting into the UAPI semantics, thank you for all the heavy
>> lifting you've done here. Figuring out how to make it all work across
>> the different platforms is not easy :)
>>
>> <snip>
>>
>> > The policy definitions below provide more details:
>
> Please drop "CONTENT_POLICY" from the KVM documentation.  From KVM's 
> perspective,
> these are not "policy", they are purely properties of the underlying memory.
> Userspace will likely use the attributes to implement policy of some kind, but
> KVM straight up doesn't care.

Policy might have been the wrong word. I think this is a property of the
conversion process/request, not a property of the memory like how
shared/private is a property of the memory?

I'll have to find another word to describe this enum of

* KVM_SET_MEMORY_ATTRIBUTES2_ZERO
* KVM_SET_MEMORY_ATTRIBUTES2_PRESERVE

>
>> > ``KVM_SET_MEMORY_ATTRIBUTES2_CONTENT_POLICY_ZERO`` (default)
>
> The default behavior absolutely cannot be something that's not supported on
> every conversion type.
>
>> >
>> >   On a private to shared conversion, the host will read zeros from the
>> >   converted memory on the next fault after successful return of the
>> >   KVM_SET_MEMORY_ATTRIBUTES2 ioctl.
>> >
>> >   This is not supported (-EOPNOTSUPP) for a shared to private
>> >   conversion. While some CoCo implementations do zero memory contents
>> >   such that the guest reads zeros after conversion, the guest is not
>> >   expected to trust host-provided zeroing, hence as a UAPI policy, KVM
>> >   does not make any such guarantees.
>>
>> The rationale for not supporting this in the UAPI isn't quite right
>> and I think that the prohibition should be removed. It's true that the
>> guest is not expected to trust host-provided zeroing. However, if the
>> VMM invokes this ioctl with the ZERO policy, the zeroing is performed
>> by the hypervisor, not by the (untrusted) host.
>
> What entity zeros the data doesn't matter as far as KVM's ABI is concerned.  
> That's
> a motivating favor to providing ZERO, e.g. it allow userspace to elide 
> additional
> zeroing when it _knows_ the memory holds zeros, but that's orthogonal to KVM's
> contract with userspace.
>
>> Although pKVM handles fresh, zeroed memory provisioning via donation
>> rather than attribute conversion, stating that the UAPI cannot make
>> guarantees due to trust boundaries is incorrect. The hypervisor is
>
> We should avoid using "hypervisor", because (a) it means different things to
> different people and (b) even when there's consensus on what "hypervisor" 
> means,
> whether or not the hypervisor is trusted varies per implementation.
>
>> need to be careful witho precisely the entity the guest trusts to enforce
>> this.
>>
>> The UAPI should define the semantics for a shared-to-private ZERO
>> conversion, even if current architectures return -EOPNOTSUPP because
>> they handle fresh memory provisioning via other mechanisms (like
>> pKVM's donation path).
>>
>> How about something like the following:
>>
>> On a shared to private conversion, the hypervisor will zero the memory
>
> Again, say _nothing_ about "the hypervisor".  _How_ or when anything happens 
> is
> completely irrelevant, the only thing that matters here is _what_ happens.
>
>> contents before mapping it into the guest's private address space,
>> preventing the untrusted host from injecting arbitrary data into the
>> guest. If an architecture handles zeroed-provisioning via mechanisms
>> other than attribute conversion, it may return -EOPNOTSUPP.
>
> No.  I am 100% against bleeding vendor specific information into KVM's ABI for
> this.  What the vendor code does is irrelevant, the _only_ thing that matters
> here is KVM's contract with userspace.
>
> That doesn't mean pKVM guests can't rely on memory being zeroed, but that is a
> contract between pKVM and its guests, not between KVM and host userspace.
>

If pKVM's (kernel, or elsewhere) documentation says something like

  Shared to private (in addition to private to shared already specified
  in the userspace/KVM contract) conversions through guest_memfd
  specifying ZERO will have memory contents zeroed.

Would that then cover both perspectives? I see Fuad's point that pKVM
would like to provide guarantees in the shared to private direction too,
and I see Sean's point that the shared to private direction isn't a
userspace/KVM thing.

The awkward part is that we guarantee both directions for PRESERVE but
not for ZERO.

>> >   For testing purposes, the KVM_X86_SW_PROTECTED_VM testing vehicle
>> >   will support this policy and ensure zeroing for conversions in both
>> >   directions.
>> >
>> > ``KVM_SET_MEMORY_ATTRIBUTES2_CONTENT_POLICY_PRESERVE``
>> >
>> >   On private/shared conversions in both directions, memory contents
>> >   will be preserved and readable. As a concrete example, if the host
>> >   writes ``0xbeef`` to memory and converts the memory to shared, the
>> >   guest will also read ``0xbeef``, after any necessary hardware or
>> >   software provided decryption. After a reverse shared to private
>> >   conversion, the host will also read ``0xbeef``.
>>
>> I think that this example is backwards. If the host writes to memory,
>> that memory is already shared, isn't it? Converting it to shared is
>> redundant. More importantly, if memory undergoes a shared-to-private
>> conversion, the host must lose access entirely.
>
> Ya, it's messed up.
>

Omg, it is backwards!! Might have been copypasta...

>> Maybe a clearer example would reflect actual payload injection and
>> bounce buffer sharing:
>> - Shared-to-Private (Payload Injection): The host writes a payload
>> (e.g., 0xbeef) to shared memory and converts it to private. The guest
>> reads 0xbeef in its private address space. The host loses access.
>> - Private-to-Shared (Bounce Buffer): The guest writes 0xbeef to
>> private memory and converts it to shared. The host reads 0xbeef.
>>
>> >   pKVM (ARM) is the first user of this policy. Since pKVM does not
>> >   protect memory with encryption, a content policy to preserve memory
>> >   will not will not involve any decryption. The guest will be able to
>> >   read what the host wrote with full content preservation.
>>
>> This is correct, but to be precise, I think it should explicitly
>> mention Stage-2 page tables as the protection mechanism, maybe:
>
> pKVM shouldn't be mentioned in here at all.
>
> ---
> By default, KVM makes no guarantees about the in-memory values after memory is
> convert to/from shared/private.  Optionally, userspace may instruct KVM to
> ensure the contents of memory are zeroed or preserved, e.g. to enable in-place
> sharing of data, or as an optimization to avoid having to re-zero memory when
> the trusted entity guarantees the memory will be zeroed after conversion.
>

How about:

or as an optimization to avoid having to re-zero memory when userspace
could have relied on the trusted entity to guarantee the memory will be
zeroed as part of the entire conversion process.

> The behaviors supported by a given KVM instance can be queried via <cap>.  If

I started with some implementation and was questioning the value of a
CAP. It seems like there won't be anything dynamic about this?

The userspace code can check what platform it is running on, and then
decide ZERO or PRESERVE based on the platform:

If the VM is running on TDX, it would want to specify ZERO all the
time. If the VM were running on pKVM it would want to specify PRESERVE
if it wants to enable in-place sharing, and ZERO if it wants to zero the
memory.

If someday TDX supports PRESERVE, then there's room for discovery of
which algorithm to choose when running the guest. Perhaps that's when
the CAP should be introduced?

> the requested behavior is an unsupported, KVM will return -EOPNOTSUPP and
> reject the conversion request.  Note!  The "ZERO" request is only support for
> private to shared conversion!
>

Do you mean ZERO is only guaranteed for private to shared? If we say
"ZERO is only guaranteed for private to shared", then pKVM could
additionally guarantee zeroing for shared to private. If we say it is
only supported for private to shared, then should I return -EOPNOTSUPP
and therefore not allow platforms to provide other guarantees?

I think we should stick to guarantees for this

* not specified (default) = no guarantees whatsoever
* ZERO = guaranteed zero for shared to private, no guarantees for
         private to shared. Platforms can add on more guarantees.
* PRESERVE = guaranteed preseved in both directions

-EOPNOTSUPP should probably be understood as "There is no way to
guarantee this" like how TDX would return -EOPNOTSUPP for PRESERVE
(now).

> ``KVM_SET_MEMORY_ATTRIBUTES2_ZERO``
>
>   On conversion, KVM guarantees all entities that have "allowed" access to the
>   memory will read zeros.  E.g. on private to shared conversion, both trusted
>   and untrusted code will read zeros.
>
>   Zeroing is currently only supported for private-to-shared conversions, as 
> KVM
>   in general is untrusted and thus cannot guarantee the guest (or any trusted
>   entity) will read zeros after conversion.  Note, some CoCo implementations 
> do
>   zero memory contents such that the guest reads zeros after conversion, and
>   the guest may choose to rely on that behavior.  But that's a contract 
> between
>   the trusted CoCo entity and the guest, not between KVM and the guest.
>
> ``KVM_SET_MEMORY_ATTRIBUTES2_PRESERVE``
>
>   On conversion, KVM guarantees memory contents will be preserved with respect
>   to the last written unencrypted value.  As a concrete example, if the host
>   writes ``0xbeef`` to shared memory and converts the memory to private, the
>   guest will also read ``0xbeef``, even if the in-memory data is encrypted as
>   part of the conversion.  And vice versa, if the guest writes ``0xbeef`` to
>   private memory and then converts the memory to shared, the host (and guest)
>   will read ``0xbeef`` (if the memory is accessible).

Thank you for this summary :)

I see you dropped any documentation to do with testing. I meant to
document it (at least something about the unspecified case) so it can be
relied on in selftests, with the understanding (already specified
elsewhere in Documentation/virt/kvm/api.rst) that nothing about
KVM_X86_SW_PROTECTED_VM is to be relied on in production, and can be
changed anytime. What do you think?

Reply via email to