On 13/5/25 20:03, Zhi Wang wrote:
On Mon, 12 May 2025 11:06:17 -0300
Jason Gunthorpe <j...@nvidia.com> wrote:
On Mon, May 12, 2025 at 07:30:21PM +1000, Alexey Kardashevskiy wrote:
I'm surprised by this.. iommufd shouldn't be doing PCI stuff,
it is just about managing the translation control of the device.
I have a little difficulty to understand. Is TSM bind PCI stuff?
To me it is. Host sends PCI TDISP messages via PCI DOE to put the
device in TDISP LOCKED state, so that device behaves differently
from before. Then why put it in IOMMUFD?
"TSM bind" sets up the CPU side of it, it binds a VM to a piece of
IOMMU on the host CPU. The device does not know about the VM, it
just enables/disables encryption by a request from the CPU (those
start/stop interface commands). And IOMMUFD won't be doing DOE, the
platform driver (such as AMD CCP) will. Nothing to do for VFIO here.
We probably should notify VFIO about the state transition but I do
not know VFIO would want to do in response.
We have an awkward fit for what CCA people are doing to the various
Linux APIs. Looking somewhat maximally across all the arches a "bind"
for a CC vPCI device creation operation does:
- Setup the CPU page tables for the VM to have access to the MMIO
- Revoke hypervisor access to the MMIO
- Setup the vIOMMU to understand the vPCI device
- Take over control of some of the IOVA translation, at least for
T=1, and route to the the vIOMMU
- Register the vPCI with any attestation functions the VM might use
- Do some DOE stuff to manage/validate TDSIP/etc
So we have interactions of things controlled by PCI, KVM, VFIO, and
iommufd all mushed together.
iommufd is the only area that already has a handle to all the required
objects:
- The physical PCI function
- The CC vIOMMU object
- The KVM FD
- The CC vPCI object
Which is why I have been thinking it is the right place to manage
this.
It doesn't mean that iommufd is suddenly doing PCI stuff, no, that
stays in VFIO.
So your issue is you need to shoot down the dmabuf during vPCI
device destruction?
I assume "vPCI device" refers to assigned device in both shared
mode & prvate mode. So no, I need to shoot down the dmabuf during
TSM unbind, a.k.a. when assigned device is converting from
private to shared. Then recover the dmabuf after TSM unbind. The
device could still work in VM in shared mode.
What are you trying to protect with this? Is there some intelism where
you can't have references to encrypted MMIO pages?
I think it is a matter of design choice. The encrypted MMIO page is
related to the TDI context and secure second level translation table
(S-EPT). and S-EPT is related to the confidential VM's context.
AMD and ARM have another level of HW control, together
with a TSM-owned meta table, can simply mask out the access to those
encrypted MMIO pages. Thus, the life cycle of the encrypted mappings in
the second level translation table can be de-coupled from the TDI
unbound. They can be reaped un-harmfully later by hypervisor in another
path.
While on Intel platform, it doesn't have that additional level of
HW control by design. Thus, the cleanup of encrypted MMIO page mapping
in the S-EPT has to be coupled tightly with TDI context destruction in
the TDI unbind process.
If the TDI unbind is triggered in VFIO/IOMMUFD, there has be a
cross-module notification to KVM to do cleanup in the S-EPT.
QEMU should know about this unbind and can tell KVM about it too. No cross
module notification needed, it is not a hot path.
So shooting down the DMABUF object (encrypted MMIO page) means shooting
down the S-EPT mapping and recovering the DMABUF object means
re-construct the non-encrypted MMIO mapping in the EPT after the TDI is
unbound.
This is definitely QEMU's job to re-mmap MMIO to the userspace (as it does for
non-trusted devices today) so later on nested page fault could fill the nested
PTE. Thanks,
Z.
What I really want is, one SW component to manage MMIO dmabuf,
secure iommu & TSM bind/unbind. So easier coordinate these 3
operations cause these ops are interconnected according to secure
firmware's requirement.
This SW component is QEMU. It knows about FLRs and other config
space things, it can destroy all these IOMMUFD objects and talk to
VFIO too, I've tried, so far it is looking easier to manage. Thanks,
Yes, qemu should be sequencing this. The kernel only needs to enforce
any rules required to keep the system from crashing.
Jason
--
Alexey