On 16/5/25 02:04, Xu Yilun wrote:
On Wed, May 14, 2025 at 01:33:39PM -0300, Jason Gunthorpe wrote:
On Wed, May 14, 2025 at 03:02:53PM +0800, Xu Yilun wrote:
We have an awkward fit for what CCA people are doing to the various
Linux APIs. Looking somewhat maximally across all the arches a "bind"
for a CC vPCI device creation operation does:

  - Setup the CPU page tables for the VM to have access to the MMIO

This is guest side thing, is it? Anything host need to opt-in?

CPU hypervisor page tables.

  - Revoke hypervisor access to the MMIO

VFIO could choose never to mmap MMIO, so in this case nothing to do?

Yes, if you do it that way.
  - Setup the vIOMMU to understand the vPCI device
  - Take over control of some of the IOVA translation, at least for T=1,
    and route to the the vIOMMU
  - Register the vPCI with any attestation functions the VM might use
  - Do some DOE stuff to manage/validate TDSIP/etc

Intel TDX Connect has a extra requirement for "unbind":

- Revoke KVM page table (S-EPT) for the MMIO only after TDISP
   CONFIG_UNLOCK

Maybe you could express this as the S-EPT always has the MMIO mapped
into it as long as the vPCI function is installed to the VM?

Yeah.

Is KVM responsible for the S-EPT?

Yes.


Another thing is, seems your term "bind" includes all steps for
shared -> private conversion.

Well, I was talking about vPCI creation. I understand that during the
vPCI lifecycle the VM will do "bind" "unbind" which are more or less
switching the device into a T=1 mode. Though I understood on some

I want to introduce some terms about CC vPCI.

1. "Bind", guest requests host do host side CC setup & put device in
CONFIG_LOCKED state, waiting for attestation. Any further change which
has secuity concern breaks "bind", e.g. reset, touch MMIO, physical MSE,
BAR addr...

2. "Attest", after "bind", guest verifies device evidences (cert,
measurement...).

3. "Accept", after successful attestation, guest do guest side CC setup &
switch the device into T=1 mode (TDISP RUN state)

(implementation note)
AMD SEV moves TDI to RUN at "Attest" as a guest still can avoid encrypted MMIO 
access and the PSP keeps IOMMU blocked until the guest enables it.

4. "Unbind", guest requests host put device in CONFIG_UNLOCK state +
remove all CC setup.

arches this was mostly invisible to the hypervisor?

Attest & Accept can be invisible to hypervisor, or host just help pass
data blobs between guest, firmware & device.

No, they cannot.

Bind cannot be host agnostic, host should be aware not to touch device
after Bind.

Bind actually connects a TDI to a guest, the guest could not possibly do that 
alone as it does not know/have access to the physical PCI function#0 to do the 
DOE/SecSPDM messaging, and neither does the PSP.

The non-touching clause (or, more precisely "selectively touching") is about "Attest" and 
"Accept" when the TDI is in the CONFIG_LOCKED or RUN state. Up to the point when we rather want to 
block the config space and MSIX BAR access after the TDI is CONFIG_LOCKED/RUN to prevent TDI from going to 
the ERROR state.



But in my mind, "bind" only includes
putting device in TDISP LOCK state & corresponding host setups required
by firmware. I.e "bind" means host lockes down the CC setup, waiting for
guest attestation.

So we will need to have some other API for this that modifies the vPCI
object.

IIUC, in Alexey's patch ioctl(iommufd, IOMMU_VDEVICE_TSM_BIND) does the
"Bind" thing in host.


I am still not sure what "vPCI" means exactly, a passed through PCI device? Or 
a piece of vIOMMU handling such device?


It might be reasonable to have VFIO reach into iommufd to do that on
an already existing iommufd VDEVICE object. A little weird, but we
could probably make that work.

Mm, Are you proposing an uAPI in VFIO, and a kAPI from VFIO -> IOMMUFD like:

  ioctl(vfio_fd, VFIO_DEVICE_ATTACH_VDEV, vdev_id)
  -> iommufd_device_attach_vdev()
     -> tsm_tdi_bind()


But you have some weird ordering issues here if the S-EPT has to have
the VFIO MMIO then you have to have a close() destruction order that

Yeah, by holding kvm reference.

sees VFIO remove the S-EPT and release the KVM, then have iommufd
destroy the VDEVICE object.

Regarding VM destroy, TDX Connect has more enforcement, VM could only be
destroyed after all assigned CC vPCI devices are destroyed.

Can be done by making IOMMUFD/vdevice holding the kvm pointer to ensure 
tsm_tdi_unbind() is not called before the guest disappeared from the firmware. 
I seem to be just lucky with the current order of things being destroyed, hmm.


Nowadays, VFIO already holds KVM reference, so we need

close(vfio_fd)
-> iommufd_device_detach_vdev()
    -> tsm_tdi_unbind()
       -> tdi stop
       -> callback to VFIO, dmabuf_move_notify(revoke)
          -> KVM unmap MMIO
       -> tdi metadata remove
-> kvm_put_kvm()
    -> kvm_destroy_vm()



It doesn't mean that iommufd is suddenly doing PCI stuff, no, that
stays in VFIO.

I'm not sure if Alexey's patch [1] illustates your idea. It calls
tsm_tdi_bind() which directly does device stuff, and impacts MMIO.
VFIO doesn't know about this.

VFIO knows about this enough as we asked it to share MMIO via dmabuf's fd and 
not via mmap(), otherwise it is the same MMIO, exactly where it was, BARs do 
not change.


I have to interpret this as VFIO firstly hand over device CC features
and MMIO resources to IOMMUFD, so VFIO never cares about them.

[1] https://lore.kernel.org/all/20250218111017.491719-15-...@amd.com/

There is also the PCI layer involved here and maybe PCI should be
participating in managing some of this. Like it makes a bit of sense
that PCI would block the FLR on platforms that require this?

FLR to a bound device is absolutely fine, just break the CC state.
Sometimes it is exactly what host need to stop CC immediately.
The problem is in VFIO's pre-FLR handling so we need to patch VFIO, not
PCI core.

What is a problem here exactly?
FLR by the host which equals to any other PCI error? The guest may or may not 
be able to handle it, afaik it does not handle any errors now, QEMU just stops 
the guest.
Or FLR by the guest? Then it knows it needs to do the dance with attest/accept, 
again.

Thanks,


Thanks,
Yilun


Jason

--
Alexey

Reply via email to