On Mon, May 12, 2025 at 11:06:17AM -0300, Jason Gunthorpe wrote: > On Mon, May 12, 2025 at 07:30:21PM +1000, Alexey Kardashevskiy wrote: > > > > > I'm surprised by this.. iommufd shouldn't be doing PCI stuff, it is > > > > just about managing the translation control of the device. > > > > > > I have a little difficulty to understand. Is TSM bind PCI stuff? To me > > > it is. Host sends PCI TDISP messages via PCI DOE to put the device in > > > TDISP LOCKED state, so that device behaves differently from before. Then > > > why put it in IOMMUFD? > > > > > > "TSM bind" sets up the CPU side of it, it binds a VM to a piece of > > IOMMU on the host CPU. The device does not know about the VM, it > > just enables/disables encryption by a request from the CPU (those > > start/stop interface commands). And IOMMUFD won't be doing DOE, the > > platform driver (such as AMD CCP) will. Nothing to do for VFIO here. > > > > We probably should notify VFIO about the state transition but I do > > not know VFIO would want to do in response. > > We have an awkward fit for what CCA people are doing to the various > Linux APIs. Looking somewhat maximally across all the arches a "bind" > for a CC vPCI device creation operation does: > > - Setup the CPU page tables for the VM to have access to the MMIO
This is guest side thing, is it? Anything host need to opt-in? > - Revoke hypervisor access to the MMIO VFIO could choose never to mmap MMIO, so in this case nothing to do? > - Setup the vIOMMU to understand the vPCI device > - Take over control of some of the IOVA translation, at least for T=1, > and route to the the vIOMMU > - Register the vPCI with any attestation functions the VM might use > - Do some DOE stuff to manage/validate TDSIP/etc Intel TDX Connect has a extra requirement for "unbind": - Revoke KVM page table (S-EPT) for the MMIO only after TDISP CONFIG_UNLOCK Another thing is, seems your term "bind" includes all steps for shared -> private conversion. But in my mind, "bind" only includes putting device in TDISP LOCK state & corresponding host setups required by firmware. I.e "bind" means host lockes down the CC setup, waiting for guest attestation. While "unbind" means breaking CC setup, no matter the vPCI device is already accepted as CC device, or only locked and waiting for attestation. > > So we have interactions of things controlled by PCI, KVM, VFIO, and > iommufd all mushed together. > > iommufd is the only area that already has a handle to all the required > objects: > - The physical PCI function > - The CC vIOMMU object > - The KVM FD > - The CC vPCI object > > Which is why I have been thinking it is the right place to manage > this. Yeah, I see the merit. > > It doesn't mean that iommufd is suddenly doing PCI stuff, no, that > stays in VFIO. I'm not sure if Alexey's patch [1] illustates your idea. It calls tsm_tdi_bind() which directly does device stuff, and impacts MMIO. VFIO doesn't know about this. I have to interpret this as VFIO firstly hand over device CC features and MMIO resources to IOMMUFD, so VFIO never cares about them. [1] https://lore.kernel.org/all/20250218111017.491719-15-...@amd.com/ > > > > > So your issue is you need to shoot down the dmabuf during vPCI device > > > > destruction? > > > > > > I assume "vPCI device" refers to assigned device in both shared mode & > > > prvate mode. So no, I need to shoot down the dmabuf during TSM unbind, > > > a.k.a. when assigned device is converting from private to shared. > > > Then recover the dmabuf after TSM unbind. The device could still work > > > in VM in shared mode. > > What are you trying to protect with this? Is there some intelism where > you can't have references to encrypted MMIO pages? > > > > What I really want is, one SW component to manage MMIO dmabuf, secure > > > iommu & TSM bind/unbind. So easier coordinate these 3 operations cause > > > these ops are interconnected according to secure firmware's requirement. > > > > This SW component is QEMU. It knows about FLRs and other config > > space things, it can destroy all these IOMMUFD objects and talk to > > VFIO too, I've tried, so far it is looking easier to manage. Thanks, > > Yes, qemu should be sequencing this. The kernel only needs to enforce > any rules required to keep the system from crashing. To keep from crashing, The kernel still needs to enforce some firmware specific rules. That doesn't reduce the interactions between kernel components. E.g. for TDX, if VFIO doesn't control "bind" but controls MMIO, it should refuse FLR or MSE when device is bound. That means VFIO should at least know from IOMMUFD whether device is bound. Further more, these rules are platform firmware specific, "QEMU executes kernel checks" means more SW components should be aware of these rules. That multiples the effort. And QEMU can be killed, means if kernel wants to reclaim all the resources, it still have to deal with the sequencing. And I don't think it is a good idea that kernel just stales large amount of resources. Thanks, Yilun > > Jason