On Mon, May 12, 2025 at 11:06:17AM -0300, Jason Gunthorpe wrote:
> On Mon, May 12, 2025 at 07:30:21PM +1000, Alexey Kardashevskiy wrote:
> 
> > > > I'm surprised by this.. iommufd shouldn't be doing PCI stuff, it is
> > > > just about managing the translation control of the device.
> > > 
> > > I have a little difficulty to understand. Is TSM bind PCI stuff? To me
> > > it is. Host sends PCI TDISP messages via PCI DOE to put the device in
> > > TDISP LOCKED state, so that device behaves differently from before. Then
> > > why put it in IOMMUFD?
> > 
> > 
> > "TSM bind" sets up the CPU side of it, it binds a VM to a piece of
> > IOMMU on the host CPU. The device does not know about the VM, it
> > just enables/disables encryption by a request from the CPU (those
> > start/stop interface commands). And IOMMUFD won't be doing DOE, the
> > platform driver (such as AMD CCP) will. Nothing to do for VFIO here.
> > 
> > We probably should notify VFIO about the state transition but I do
> > not know VFIO would want to do in response.
> 
> We have an awkward fit for what CCA people are doing to the various
> Linux APIs. Looking somewhat maximally across all the arches a "bind"
> for a CC vPCI device creation operation does:
> 
>  - Setup the CPU page tables for the VM to have access to the MMIO

This is guest side thing, is it? Anything host need to opt-in?

>  - Revoke hypervisor access to the MMIO

VFIO could choose never to mmap MMIO, so in this case nothing to do?

>  - Setup the vIOMMU to understand the vPCI device
>  - Take over control of some of the IOVA translation, at least for T=1,
>    and route to the the vIOMMU
>  - Register the vPCI with any attestation functions the VM might use
>  - Do some DOE stuff to manage/validate TDSIP/etc

Intel TDX Connect has a extra requirement for "unbind":

- Revoke KVM page table (S-EPT) for the MMIO only after TDISP
  CONFIG_UNLOCK

Another thing is, seems your term "bind" includes all steps for
shared -> private conversion. But in my mind, "bind" only includes
putting device in TDISP LOCK state & corresponding host setups required
by firmware. I.e "bind" means host lockes down the CC setup, waiting for
guest attestation.

While "unbind" means breaking CC setup, no matter the vPCI device is
already accepted as CC device, or only locked and waiting for attestation.

> 
> So we have interactions of things controlled by PCI, KVM, VFIO, and
> iommufd all mushed together.
> 
> iommufd is the only area that already has a handle to all the required
> objects:
>  - The physical PCI function
>  - The CC vIOMMU object
>  - The KVM FD
>  - The CC vPCI object
> 
> Which is why I have been thinking it is the right place to manage
> this.

Yeah, I see the merit.

> 
> It doesn't mean that iommufd is suddenly doing PCI stuff, no, that
> stays in VFIO.

I'm not sure if Alexey's patch [1] illustates your idea. It calls
tsm_tdi_bind() which directly does device stuff, and impacts MMIO.
VFIO doesn't know about this.

I have to interpret this as VFIO firstly hand over device CC features
and MMIO resources to IOMMUFD, so VFIO never cares about them.

[1] https://lore.kernel.org/all/20250218111017.491719-15-...@amd.com/

> 
> > > > So your issue is you need to shoot down the dmabuf during vPCI device
> > > > destruction?
> > > 
> > > I assume "vPCI device" refers to assigned device in both shared mode &
> > > prvate mode. So no, I need to shoot down the dmabuf during TSM unbind,
> > > a.k.a. when assigned device is converting from private to shared.
> > > Then recover the dmabuf after TSM unbind. The device could still work
> > > in VM in shared mode.
> 
> What are you trying to protect with this? Is there some intelism where
> you can't have references to encrypted MMIO pages?
> 
> > > What I really want is, one SW component to manage MMIO dmabuf, secure
> > > iommu & TSM bind/unbind. So easier coordinate these 3 operations cause
> > > these ops are interconnected according to secure firmware's requirement.
> >
> > This SW component is QEMU. It knows about FLRs and other config
> > space things, it can destroy all these IOMMUFD objects and talk to
> > VFIO too, I've tried, so far it is looking easier to manage. Thanks,
> 
> Yes, qemu should be sequencing this. The kernel only needs to enforce
> any rules required to keep the system from crashing.

To keep from crashing, The kernel still needs to enforce some firmware
specific rules. That doesn't reduce the interactions between kernel
components. E.g. for TDX, if VFIO doesn't control "bind" but controls
MMIO, it should refuse FLR or MSE when device is bound. That means VFIO
should at least know from IOMMUFD whether device is bound.

Further more, these rules are platform firmware specific, "QEMU executes
kernel checks" means more SW components should be aware of these rules.
That multiples the effort.

And QEMU can be killed, means if kernel wants to reclaim all the
resources, it still have to deal with the sequencing. And I don't think
it is a good idea that kernel just stales large amount of resources.

Thanks,
Yilun
> 
> Jason

Reply via email to