On Thu, 15 May 2025 16:44:47 +0000 Zhi Wang <z...@nvidia.com> wrote: > On 15.5.2025 13.29, Alexey Kardashevskiy wrote: > > > > > > On 13/5/25 20:03, Zhi Wang wrote: > >> On Mon, 12 May 2025 11:06:17 -0300 > >> Jason Gunthorpe <j...@nvidia.com> wrote: > >> > >>> On Mon, May 12, 2025 at 07:30:21PM +1000, Alexey Kardashevskiy > >>> wrote: > >>> > >>>>>> I'm surprised by this.. iommufd shouldn't be doing PCI stuff, > >>>>>> it is just about managing the translation control of the > >>>>>> device. > >>>>> > >>>>> I have a little difficulty to understand. Is TSM bind PCI stuff? > >>>>> To me it is. Host sends PCI TDISP messages via PCI DOE to put > >>>>> the device in TDISP LOCKED state, so that device behaves > >>>>> differently from before. Then why put it in IOMMUFD? > >>>> > >>>> > >>>> "TSM bind" sets up the CPU side of it, it binds a VM to a piece > >>>> of IOMMU on the host CPU. The device does not know about the VM, > >>>> it just enables/disables encryption by a request from the CPU > >>>> (those start/stop interface commands). And IOMMUFD won't be > >>>> doing DOE, the platform driver (such as AMD CCP) will. Nothing > >>>> to do for VFIO here. > >>>> > >>>> We probably should notify VFIO about the state transition but I > >>>> do not know VFIO would want to do in response. > >>> > >>> We have an awkward fit for what CCA people are doing to the > >>> various Linux APIs. Looking somewhat maximally across all the > >>> arches a "bind" for a CC vPCI device creation operation does: > >>> > >>> - Setup the CPU page tables for the VM to have access to the > >>> MMIO > >>> - Revoke hypervisor access to the MMIO > >>> - Setup the vIOMMU to understand the vPCI device > >>> - Take over control of some of the IOVA translation, at least > >>> for T=1, and route to the the vIOMMU > >>> - Register the vPCI with any attestation functions the VM might > >>> use > >>> - Do some DOE stuff to manage/validate TDSIP/etc > >>> > >>> So we have interactions of things controlled by PCI, KVM, VFIO, > >>> and iommufd all mushed together. > >>> > >>> iommufd is the only area that already has a handle to all the > >>> required objects: > >>> - The physical PCI function > >>> - The CC vIOMMU object > >>> - The KVM FD > >>> - The CC vPCI object > >>> > >>> Which is why I have been thinking it is the right place to manage > >>> this. > >>> > >>> It doesn't mean that iommufd is suddenly doing PCI stuff, no, that > >>> stays in VFIO. > >>> > >>>>>> So your issue is you need to shoot down the dmabuf during vPCI > >>>>>> device destruction? > >>>>> > >>>>> I assume "vPCI device" refers to assigned device in both shared > >>>>> mode & prvate mode. So no, I need to shoot down the dmabuf > >>>>> during TSM unbind, a.k.a. when assigned device is converting > >>>>> from private to shared. Then recover the dmabuf after TSM > >>>>> unbind. The device could still work in VM in shared mode. > >>> > >>> What are you trying to protect with this? Is there some intelism > >>> where you can't have references to encrypted MMIO pages? > >>> > >> > >> I think it is a matter of design choice. The encrypted MMIO page is > >> related to the TDI context and secure second level translation > >> table (S-EPT). and S-EPT is related to the confidential VM's > >> context. > >> > >> AMD and ARM have another level of HW control, together > >> with a TSM-owned meta table, can simply mask out the access to > >> those encrypted MMIO pages. Thus, the life cycle of the encrypted > >> mappings in the second level translation table can be de-coupled > >> from the TDI unbound. They can be reaped un-harmfully later by > >> hypervisor in another path. > >> > >> While on Intel platform, it doesn't have that additional level of > >> HW control by design. Thus, the cleanup of encrypted MMIO page > >> mapping in the S-EPT has to be coupled tightly with TDI context > >> destruction in the TDI unbind process. > >> > >> If the TDI unbind is triggered in VFIO/IOMMUFD, there has be a > >> cross-module notification to KVM to do cleanup in the S-EPT. > > > > QEMU should know about this unbind and can tell KVM about it too. > > No cross module notification needed, it is not a hot path. > > > > Yes. QEMU knows almost everything important, it can do the required > flow and kernel can enforce the requirements. There shouldn't be > problem at runtime. > > But if QEMU crashes, what are left there are only fd closing paths > and objects that fds represent in the kernel. The modules those fds > belongs need to solve the dependencies of tearing down objects > without the help of QEMU. > > There will be private MMIO dmabuf fds, VFIO fds, IOMMU device fd, KVM > fds at that time. Who should trigger the TDI unbind at this time? > > I think it should be triggered in the vdevice teardown path in IOMMUfd > fd closing path, as it is where the bind is initiated. > > iommufd vdevice tear down (iommu fd closing path) > ----> tsm_tdi_unbind > ----> intel_tsm_tdi_unbind > ... > ----> private MMIO un-maping in KVM > ----> cleanup private MMIO mapping in S-EPT and > others ----> signal MMIO dmabuf can be safely removed. > ^TVM teardown path (dmabuf uninstall path) > checks this state and wait before it can decrease the > dmabuf fd refcount > ... > ----> KVM TVM fd put > ----> continue iommufd vdevice teardown. > > Also, I think we need: > > iommufd vdevice TSM bind > ---> tsm_tdi_bind > ----> intel_tsm_tdi_bind > ... > ----> KVM TVM fd get
ident problem, I mean KVM TVM fd is in tsm_tdi_bind(). I saw your code has already had it there. > ... > > Z. > > > > >> So shooting down the DMABUF object (encrypted MMIO page) means > >> shooting down the S-EPT mapping and recovering the DMABUF object > >> means re-construct the non-encrypted MMIO mapping in the EPT after > >> the TDI is unbound. > > > > This is definitely QEMU's job to re-mmap MMIO to the userspace (as > > it does for non-trusted devices today) so later on nested page > > fault could fill the nested PTE. Thanks, > > > > > >> > >> Z. > >> > >>>>> What I really want is, one SW component to manage MMIO dmabuf, > >>>>> secure iommu & TSM bind/unbind. So easier coordinate these 3 > >>>>> operations cause these ops are interconnected according to > >>>>> secure firmware's requirement. > >>>> > >>>> This SW component is QEMU. It knows about FLRs and other config > >>>> space things, it can destroy all these IOMMUFD objects and talk > >>>> to VFIO too, I've tried, so far it is looking easier to manage. > >>>> Thanks, > >>> > >>> Yes, qemu should be sequencing this. The kernel only needs to > >>> enforce any rules required to keep the system from crashing. > >>> > >>> Jason > >>> > >> > > >