On Thu, 15 May 2025 16:44:47 +0000
Zhi Wang <z...@nvidia.com> wrote:

> On 15.5.2025 13.29, Alexey Kardashevskiy wrote:
> > 
> > 
> > On 13/5/25 20:03, Zhi Wang wrote:
> >> On Mon, 12 May 2025 11:06:17 -0300
> >> Jason Gunthorpe <j...@nvidia.com> wrote:
> >>
> >>> On Mon, May 12, 2025 at 07:30:21PM +1000, Alexey Kardashevskiy
> >>> wrote:
> >>>
> >>>>>> I'm surprised by this.. iommufd shouldn't be doing PCI stuff,
> >>>>>> it is just about managing the translation control of the
> >>>>>> device.
> >>>>>
> >>>>> I have a little difficulty to understand. Is TSM bind PCI stuff?
> >>>>> To me it is. Host sends PCI TDISP messages via PCI DOE to put
> >>>>> the device in TDISP LOCKED state, so that device behaves
> >>>>> differently from before. Then why put it in IOMMUFD?
> >>>>
> >>>>
> >>>> "TSM bind" sets up the CPU side of it, it binds a VM to a piece
> >>>> of IOMMU on the host CPU. The device does not know about the VM,
> >>>> it just enables/disables encryption by a request from the CPU
> >>>> (those start/stop interface commands). And IOMMUFD won't be
> >>>> doing DOE, the platform driver (such as AMD CCP) will. Nothing
> >>>> to do for VFIO here.
> >>>>
> >>>> We probably should notify VFIO about the state transition but I
> >>>> do not know VFIO would want to do in response.
> >>>
> >>> We have an awkward fit for what CCA people are doing to the
> >>> various Linux APIs. Looking somewhat maximally across all the
> >>> arches a "bind" for a CC vPCI device creation operation does:
> >>>
> >>>   - Setup the CPU page tables for the VM to have access to the
> >>> MMIO
> >>>   - Revoke hypervisor access to the MMIO
> >>>   - Setup the vIOMMU to understand the vPCI device
> >>>   - Take over control of some of the IOVA translation, at least
> >>> for T=1, and route to the the vIOMMU
> >>>   - Register the vPCI with any attestation functions the VM might
> >>> use
> >>>   - Do some DOE stuff to manage/validate TDSIP/etc
> >>>
> >>> So we have interactions of things controlled by PCI, KVM, VFIO,
> >>> and iommufd all mushed together.
> >>>
> >>> iommufd is the only area that already has a handle to all the
> >>> required objects:
> >>>   - The physical PCI function
> >>>   - The CC vIOMMU object
> >>>   - The KVM FD
> >>>   - The CC vPCI object
> >>>
> >>> Which is why I have been thinking it is the right place to manage
> >>> this.
> >>>
> >>> It doesn't mean that iommufd is suddenly doing PCI stuff, no, that
> >>> stays in VFIO.
> >>>
> >>>>>> So your issue is you need to shoot down the dmabuf during vPCI
> >>>>>> device destruction?
> >>>>>
> >>>>> I assume "vPCI device" refers to assigned device in both shared
> >>>>> mode & prvate mode. So no, I need to shoot down the dmabuf
> >>>>> during TSM unbind, a.k.a. when assigned device is converting
> >>>>> from private to shared. Then recover the dmabuf after TSM
> >>>>> unbind. The device could still work in VM in shared mode.
> >>>
> >>> What are you trying to protect with this? Is there some intelism
> >>> where you can't have references to encrypted MMIO pages?
> >>>
> >>
> >> I think it is a matter of design choice. The encrypted MMIO page is
> >> related to the TDI context and secure second level translation
> >> table (S-EPT). and S-EPT is related to the confidential VM's
> >> context.
> >>
> >> AMD and ARM have another level of HW control, together
> >> with a TSM-owned meta table, can simply mask out the access to
> >> those encrypted MMIO pages. Thus, the life cycle of the encrypted
> >> mappings in the second level translation table can be de-coupled
> >> from the TDI unbound. They can be reaped un-harmfully later by
> >> hypervisor in another path.
> >>
> >> While on Intel platform, it doesn't have that additional level of
> >> HW control by design. Thus, the cleanup of encrypted MMIO page
> >> mapping in the S-EPT has to be coupled tightly with TDI context
> >> destruction in the TDI unbind process.
> >>
> >> If the TDI unbind is triggered in VFIO/IOMMUFD, there has be a
> >> cross-module notification to KVM to do cleanup in the S-EPT.
> > 
> > QEMU should know about this unbind and can tell KVM about it too.
> > No cross module notification needed, it is not a hot path.
> > 
> 
> Yes. QEMU knows almost everything important, it can do the required
> flow and kernel can enforce the requirements. There shouldn't be
> problem at runtime.
> 
> But if QEMU crashes, what are left there are only fd closing paths
> and objects that fds represent in the kernel. The modules those fds
> belongs need to solve the dependencies of tearing down objects
> without the help of QEMU.
> 
> There will be private MMIO dmabuf fds, VFIO fds, IOMMU device fd, KVM
> fds at that time. Who should trigger the TDI unbind at this time?
> 
> I think it should be triggered in the vdevice teardown path in IOMMUfd
> fd closing path, as it is where the bind is initiated.
> 
> iommufd vdevice tear down (iommu fd closing path)
>      ----> tsm_tdi_unbind
>          ----> intel_tsm_tdi_unbind
>              ...
>              ----> private MMIO un-maping in KVM
>                  ----> cleanup private MMIO mapping in S-EPT and
> others ----> signal MMIO dmabuf can be safely removed.
>                         ^TVM teardown path (dmabuf uninstall path)
> checks this state and wait before it can decrease the
>                         dmabuf fd refcount
>              ...
>          ----> KVM TVM fd put
>      ----> continue iommufd vdevice teardown.
> 
> Also, I think we need:
> 
> iommufd vdevice TSM bind
>      ---> tsm_tdi_bind
>          ----> intel_tsm_tdi_bind
>              ...
>              ----> KVM TVM fd get

ident problem, I mean KVM TVM fd is in tsm_tdi_bind(). I saw your code
has already had it there.

>              ...
> 
> Z.
> 
> > 
> >> So shooting down the DMABUF object (encrypted MMIO page) means
> >> shooting down the S-EPT mapping and recovering the DMABUF object
> >> means re-construct the non-encrypted MMIO mapping in the EPT after
> >> the TDI is unbound.
> > 
> > This is definitely QEMU's job to re-mmap MMIO to the userspace (as
> > it does for non-trusted devices today) so later on nested page
> > fault could fill the nested PTE. Thanks,
> > 
> > 
> >>
> >> Z.
> >>
> >>>>> What I really want is, one SW component to manage MMIO dmabuf,
> >>>>> secure iommu & TSM bind/unbind. So easier coordinate these 3
> >>>>> operations cause these ops are interconnected according to
> >>>>> secure firmware's requirement.
> >>>>
> >>>> This SW component is QEMU. It knows about FLRs and other config
> >>>> space things, it can destroy all these IOMMUFD objects and talk
> >>>> to VFIO too, I've tried, so far it is looking easier to manage.
> >>>> Thanks,
> >>>
> >>> Yes, qemu should be sequencing this. The kernel only needs to
> >>> enforce any rules required to keep the system from crashing.
> >>>
> >>> Jason
> >>>
> >>
> > 
> 

Reply via email to