Hi, Per Jason Wang's suggestion, iommufd nesting series[1] is split into "Enable stage-1 translation for emulated device" series and "Enable stage-1 translation for passthrough device" series.
This series is 2nd part focusing on passthrough device. We don't do shadowing of guest page table for passthrough device but pass stage-1 page table to host side to construct a nested domain. There was some effort to enable this feature in old days, see [2] for details. The key design is to utilize the dual-stage IOMMU translation (also known as IOMMU nested translation) capability in host IOMMU. As the below diagram shows, guest I/O page table pointer in GPA (guest physical address) is passed to host and be used to perform the stage-1 address translation. Along with it, modifications to present mappings in the guest I/O page table should be followed with an IOTLB invalidation. .-------------. .---------------------------. | vIOMMU | | Guest I/O page table | | | '---------------------------' .----------------/ | PASID Entry |--- PASID cache flush --+ '-------------' | | | V | | I/O page table pointer in GPA '-------------' Guest ------| Shadow |---------------------------|-------- v v v Host .-------------. .------------------------. | pIOMMU | | Stage1 for GIOVA->GPA | | | '------------------------' .----------------/ | | PASID Entry | V (Nested xlate) '----------------\.--------------------------------------. | | | Stage2 for GPA->HPA, unmanaged domain| | | '--------------------------------------' '-------------' For history reason, there are different namings in different VTD spec rev, Where: - Stage1 = First stage = First level = flts - Stage2 = Second stage = Second level = slts <Intel VT-d Nested translation> There are some interactions between VFIO and vIOMMU * vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI subsystem. VFIO calls them to register/unregister HostIOMMUDevice instance to vIOMMU at vfio device realize stage. * vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt to bind/unbind device to IOMMUFD backed domains, either nested domain or not. See below diagram: VFIO Device Intel IOMMU .-----------------. .-------------------. | | | | | .---------|PCIIOMMUOps |.-------------. | | | IOMMUFD |(set_iommu_device) || Host IOMMU | | | | Device |------------------------>|| Device list | | | .---------|(unset_iommu_device) |.-------------. | | | | | | | | | V | | .---------| HostIOMMUDeviceIOMMUFD | .-------------. | | | IOMMUFD | (attach_hwpt)| | Host IOMMU | | | | link |<------------------------| | Device | | | .---------| (detach_hwpt)| .-------------. | | | | | | | | | ... | .-----------------. .-------------------. Based on Yi's suggestion, this design is optimal in sharing ioas/hwpt whenever possible and create new one on demand, also supports multiple iommufd objects and ERRATA_772415. E.g., Under one guest's scope, Stage-2 page table could be shared by different devices if there is no conflict and devices link to same iommufd object, i.e. devices under same host IOMMU can share same stage-2 page table. If there is conflict, i.e. there is one device under non cache coherency mode which is different from others, it requires a separate stage-2 page table in non-CC mode. SPR platform has ERRATA_772415 which requires no readonly mappings in stage-2 page table. This series supports creating VTDIOASContainer with no readonly mappings. If there is a rare case that some IOMMUs on a multiple IOMMU host have ERRATA_772415 and others not, this design can still survive. See below example diagram for a full view: IntelIOMMUState | V .------------------. .------------------. .-------------------. | VTDIOASContainer |--->| VTDIOASContainer |--->| VTDIOASContainer |-->... | (iommufd0,RW&RO) | | (iommufd1,RW&RO) | | (iommufd0,only RW)| .------------------. .------------------. .-------------------. | | | | .-->... | V V .-------------------. .-------------------. .---------------. | VTDS2Hwpt(CC) |--->| VTDS2Hwpt(non-CC) |-->... | VTDS2Hwpt(CC) |-->... .-------------------. .-------------------. .---------------. | | | | | | | | .-----------. .-----------. .------------. .------------. | IOMMUFD | | IOMMUFD | | IOMMUFD | | IOMMUFD | | Device(CC)| | Device(CC)| | Device | | Device(CC) | | (iommufd0)| | (iommufd0)| | (non-CC) | | (errata) | | | | | | (iommufd0) | | (iommufd0) | .-----------. .-----------. .------------. .------------. This series is also a prerequisite work for vSVA, i.e. Sharing guest application address space with passthrough devices. To enable stage-1 translation, only need to add "x-scalable-mode=on,x-flts=on". i.e. -device intel-iommu,x-scalable-mode=on,x-flts=on... Passthrough device should use iommufd backend to work with stage-1 translation. i.e. -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,... If host doesn't support nested translation, qemu will fail with an unsupported report. Test done: - VFIO devices hotplug/unplug - different VFIO devices linked to different iommufds - vhost net device ping test Fault report isn't supported in this series, we presume guest kernel always construct correct S1 page table for passthrough device. For emulated devices, the emulation code already provided S1 fault injection. PATCH1-6: Add HWPT-based nesting infrastructure support PATCH7-8: Some cleanup work PATCH9: cap/ecap related compatibility check between vIOMMU and Host IOMMU PATCH10-20:Implement stage-1 page table for passthrough device PATCH21: Enable stage-1 translation for passthrough device Qemu code can be found at [3] TODO: - RAM discard - dirty tracking on stage-2 page table - Fault report to guest when HW Stage-1 faults [1] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02740.html [2] https://patchwork.kernel.org/project/kvm/cover/20210302203827.437645-1-yi.l....@intel.com/ [3] https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_nesting_rfcv3 Thanks Zhenzhong Changelog: rfcv3: - s/hwpt_id/id in iommufd_backend_invalidate_cache()'s parameter (Shameer) - hide vtd vendor specific caps in a wrapper union (Eric, Nicolin) - simplify return value check of get_cap() (Eric) - drop realize_late (Cedric, Eric) - split patch13:intel_iommu: Add PASID cache management infrastructure (Eric) - s/vtd_pasid_cache_reset/vtd_pasid_cache_reset_locked (Eric) - s/vtd_pe_get_domain_id/vtd_pe_get_did (Eric) - refine comments (Eric, Donald) rfcv2: - Drop VTDPASIDAddressSpace and use VTDAddressSpace (Eric, Liuyi) - Move HWPT uAPI patches ahead(patch1-8) so arm nesting could easily rebase - add two cleanup patches(patch9-10) - VFIO passes iommufd/devid/hwpt_id to vIOMMU instead of iommufd/devid/ioas_id - add vtd_as_[from|to]_iommu_pasid() helper to translate between vtd_as and iommu pasid, this is important for dropping VTDPASIDAddressSpace Yi Liu (3): intel_iommu: Replay pasid binds after context cache invalidation intel_iommu: Propagate PASID-based iotlb invalidation to host intel_iommu: Refresh pasid bind when either SRTP or TE bit is changed Zhenzhong Duan (18): backends/iommufd: Add a helper to invalidate user-managed HWPT vfio/iommufd: Add properties and handlers to TYPE_HOST_IOMMU_DEVICE_IOMMUFD vfio/iommufd: Initialize iommufd specific members in HostIOMMUDeviceIOMMUFD vfio/iommufd: Implement [at|de]tach_hwpt handlers vfio/iommufd: Save vendor specific device info iommufd: Implement query of host VTD IOMMU's capability intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry intel_iommu: Optimize context entry cache utilization intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on intel_iommu: Introduce a new structure VTDHostIOMMUDevice intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked intel_iommu: Handle PASID entry removing and updating intel_iommu: Handle PASID entry adding intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET intel_iommu: Bind/unbind guest page table to host intel_iommu: ERRATA_772415 workaround intel_iommu: Bypass replay in stage-1 page table mode intel_iommu: Enable host device when x-flts=on in scalable mode hw/i386/intel_iommu_internal.h | 56 + include/hw/i386/intel_iommu.h | 33 +- include/system/host_iommu_device.h | 32 + include/system/iommufd.h | 54 + backends/iommufd.c | 94 +- hw/i386/intel_iommu.c | 1670 ++++++++++++++++++++++++---- hw/vfio/iommufd.c | 40 + backends/trace-events | 1 + hw/i386/trace-events | 13 + 9 files changed, 1791 insertions(+), 202 deletions(-) -- 2.34.1