Hi, For passthrough device with intel_iommu.x-flts=on, we don't do shadowing of guest page table but pass first stage page table to host side to construct a nested HWPT. There was some effort to enable this feature in old days, see [1] for details.
The key design is to utilize the dual-stage IOMMU translation (also known as IOMMU nested translation) capability in host IOMMU. As the below diagram shows, guest I/O page table pointer in GPA (guest physical address) is passed to host and be used to perform the first stage address translation. Along with it, modifications to present mappings in the guest I/O page table should be followed with an IOTLB invalidation. .-------------. .---------------------------. | vIOMMU | | Guest I/O page table | | | '---------------------------' .----------------/ | PASID Entry |--- PASID cache flush --+ '-------------' | | | V | | I/O page table pointer in GPA '-------------' Guest ------| Shadow |---------------------------|-------- v v v Host .-------------. .-----------------------------. | pIOMMU | | First stage for GIOVA->GPA | | | '-----------------------------' .----------------/ | | PASID Entry | V (Nested xlate) '----------------\.--------------------------------------------. | | | Second stage for GPA->HPA, unmanaged domain| | | '--------------------------------------------' '-------------' <Intel VT-d Nested translation> This series reuse VFIO device's default HWPT as nesting parent instead of creating new one. This way avoids duplicate code of a new memory listener, all existing feature from VFIO listener can be shared, e.g., ram discard, dirty tracking, etc. Two limitations are: 1) not supporting VFIO device under a PCI bridge with emulated device, because emulated device wants IOMMU AS and VFIO device stick to system AS; 2) not supporting kexec or reboot from "intel_iommu=on,sm_on" to "intel_iommu=on,sm_off" on platform with ERRATA_772415_SPR17, because VFIO device's default HWPT is created with NEST_PARENT flag, kernel inhibit RO mappings when switch to shadow mode. This series is also a prerequisite work for vSVA, i.e. Sharing guest application address space with passthrough devices. There are some interactions between VFIO and vIOMMU * vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI subsystem. VFIO calls them to register/unregister HostIOMMUDevice instance to vIOMMU at vfio device realize stage. * vIOMMU registers PCIIOMMUOps get_viommu_flags to PCI subsystem. VFIO calls it to get vIOMMU exposed flags. * vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt to bind/unbind device to IOMMUFD backed domains, either nested domain or not. See below diagram: VFIO Device Intel IOMMU .-----------------. .-------------------. | | | | | .---------|PCIIOMMUOps |.-------------. | | | IOMMUFD |(set/unset_iommu_device) || Host IOMMU | | | | Device |------------------------>|| Device list | | | .---------|(get_viommu_flags) |.-------------. | | | | | | | | | V | | .---------| HostIOMMUDeviceIOMMUFD | .-------------. | | | IOMMUFD | (attach_hwpt)| | Host IOMMU | | | | link |<------------------------| | Device | | | .---------| (detach_hwpt)| .-------------. | | | | | | | | | ... | .-----------------. .-------------------. Below is an example to enable first stage translation for passthrough device: -M q35,... -device intel-iommu,x-scalable-mode=on,x-flts=on... -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,... Test done: - VFIO devices hotplug/unplug - different VFIO devices linked to different iommufds - vhost net device ping test PATCH01-09: Some preparing work PATCH10-11: Compatibility check between vIOMMU and Host IOMMU PATCH12-17: Implement first stage page table for passthrough device PATCH18-20: Workaround for ERRATA_772415_SPR17 PATCH21: Enable first stage translation for passthrough device PATCH22: Add doc Qemu code can be found at [2] Fault event injection to guest isn't supported in this series, we presume guest kernel always construct correct first stage page table for passthrough device. For emulated devices, the emulation code already provided first stage fault injection. TODO: - Fault event injection to guest when HW first stage page table faults [1] https://patchwork.kernel.org/project/kvm/cover/20210302203827.437645-1-yi.l....@intel.com/ [2] https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_nesting.v6 Thanks Zhenzhong Changelog: v6: - delete RPS capability related supporting code (Eric, Yi) - use terminology 'first/second stage' to replace 'first/second level" (Eric, Yi) - use get_viommu_flags() instead of get_viommu_caps() (Nicolin) - drop non-RID_PASID related code and simplify pasid invalidation handling (Eric, Yi) - drop the patch that handle pasid replay when context invalidation (Eric) - move vendor specific cap check from VFIO core to backend/iommufd.c (Nicolin) v5: - refine commit log of patch2 (Cedric, Nicolin) - introduce helper vfio_pci_from_vfio_device() (Cedric) - introduce helper vfio_device_viommu_get_nested() (Cedric) - pass 'bool bypass_ro' argument to vfio_listener_valid_section() instead of 'VFIOContainerBase *' (Cedric) - fix a potential build error reported by Jim Shu v4: - s/VIOMMU_CAP_STAGE1/VIOMMU_CAP_HW_NESTED (Eric, Nicolin, Donald, Shameer) - clarify get_viommu_cap() return pure emulated caps and explain reason in commit log (Eric) - retrieve the ce only if vtd_as->pasid in vtd_as_to_iommu_pasid_locked (Eric) - refine doc comment and commit log in patch10-11 (Eric) v3: - define enum type for VIOMMU_CAP_* (Eric) - drop inline flag in the patch which uses the helper (Eric) - use extract64 in new introduced MACRO (Eric) - polish comments and fix typo error (Eric) - split workaround patch for ERRATA_772415_SPR17 to two patches (Eric) - optimize bind/unbind error path processing v2: - introduce get_viommu_cap() to get STAGE1 flag to create nesting parent HWPT (Liuyi) - reuse VFIO's default HWPT as parent HWPT of nested translation (Nicolin, Liuyi) - abandon support of VFIO device under pcie-to-pci bridge to simplify design (Liuyi) - bypass RO mapping in VFIO's default HWPT if ERRATA_772415_SPR17 (Liuyi) - drop vtd_dev_to_context_entry optimization (Liuyi) v1: - simplify vendor specific checking in vtd_check_hiod (Cedric, Nicolin) - rebase to master rfcv3: - s/hwpt_id/id in iommufd_backend_invalidate_cache()'s parameter (Shameer) - hide vtd vendor specific caps in a wrapper union (Eric, Nicolin) - simplify return value check of get_cap() (Eric) - drop realize_late (Cedric, Eric) - split patch13:intel_iommu: Add PASID cache management infrastructure (Eric) - s/vtd_pasid_cache_reset/vtd_pasid_cache_reset_locked (Eric) - s/vtd_pe_get_domain_id/vtd_pe_get_did (Eric) - refine comments (Eric, Donald) rfcv2: - Drop VTDPASIDAddressSpace and use VTDAddressSpace (Eric, Liuyi) - Move HWPT uAPI patches ahead(patch1-8) so arm nesting could easily rebase - add two cleanup patches(patch9-10) - VFIO passes iommufd/devid/hwpt_id to vIOMMU instead of iommufd/devid/ioas_id - add vtd_as_[from|to]_iommu_pasid() helper to translate between vtd_as and iommu pasid, this is important for dropping VTDPASIDAddressSpace Yi Liu (2): intel_iommu: Propagate PASID-based iotlb invalidation to host intel_iommu: Replay all pasid bindings when either SRTP or TE bit is changed Zhenzhong Duan (20): intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry intel_iommu: Delete RPS capability related supporting code intel_iommu: Update terminology to match VTD spec hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool hw/pci: Introduce pci_device_get_viommu_flags() intel_iommu: Implement get_viommu_flags() callback intel_iommu: Introduce a new structure VTDHostIOMMUDevice vfio/iommufd: Force creating nesting parent HWPT intel_iommu: Stick to system MR for IOMMUFD backed host device when x-fls=on intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on intel_iommu: Fail passthrough device under PCI bridge if x-flts=on intel_iommu: Handle PASID cache invalidation intel_iommu: Reset pasid cache when system level reset intel_iommu: Add some macros and inline functions intel_iommu: Bind/unbind guest page table to host iommufd: Introduce a helper function to extract vendor capabilities vfio: Add a new element bypass_ro in VFIOContainerBase Workaround for ERRATA_772415_SPR17 intel_iommu: Enable host device when x-flts=on in scalable mode docs/devel: Add IOMMUFD nesting documentation MAINTAINERS | 1 + docs/devel/vfio-iommufd.rst | 24 + hw/i386/intel_iommu_internal.h | 100 ++- include/hw/i386/intel_iommu.h | 11 +- include/hw/iommu.h | 24 + include/hw/pci/pci.h | 29 + include/hw/vfio/vfio-container-base.h | 1 + include/hw/vfio/vfio-device.h | 2 + include/system/host_iommu_device.h | 16 + backends/iommufd.c | 13 + hw/i386/intel_iommu.c | 848 ++++++++++++++++++++------ hw/pci/pci.c | 23 +- hw/vfio/device.c | 12 + hw/vfio/iommufd.c | 19 +- hw/vfio/listener.c | 21 +- tests/qtest/intel-iommu-test.c | 4 +- hw/i386/trace-events | 7 + 17 files changed, 927 insertions(+), 228 deletions(-) create mode 100644 include/hw/iommu.h -- 2.47.1