Kindly ping, any more comments? Thanks Zhenzhong
>-----Original Message----- >From: Duan, Zhenzhong <zhenzhong.d...@intel.com> >Subject: [PATCH v4 00/20] intel_iommu: Enable stage-1 translation for >passthrough device > >Hi, > >For passthrough device with intel_iommu.x-flts=on, we don't do shadowing >of >guest page table for passthrough device but pass stage-1 page table to host >side to construct a nested domain. There was some effort to enable this >feature >in old days, see [1] for details. > >The key design is to utilize the dual-stage IOMMU translation (also known as >IOMMU nested translation) capability in host IOMMU. As the below diagram >shows, >guest I/O page table pointer in GPA (guest physical address) is passed to host >and be used to perform the stage-1 address translation. Along with it, >modifications to present mappings in the guest I/O page table should be >followed >with an IOTLB invalidation. > > .-------------. .---------------------------. > | vIOMMU | | Guest I/O page table | > | | '---------------------------' > .----------------/ > | PASID Entry |--- PASID cache flush --+ > '-------------' | > | | V > | | I/O page table pointer in GPA > '-------------' > Guest > ------| Shadow |---------------------------|-------- > v v v > Host > .-------------. .------------------------. > | pIOMMU | | Stage1 for GIOVA->GPA | > | | '------------------------' > .----------------/ | > | PASID Entry | V (Nested xlate) > '----------------\.--------------------------------------. > | | | Stage2 for GPA->HPA, unmanaged domain| > | | '--------------------------------------' > '-------------' >For history reason, there are different namings in different VTD spec rev, >Where: > - Stage1 = First stage = First level = flts > - Stage2 = Second stage = Second level = slts ><Intel VT-d Nested translation> > >This series reuse VFIO device's default hwpt as nested parent instead of >creating new one. This way avoids duplicate code of a new memory listener, >all existing feature from VFIO listener can be shared, e.g., ram discard, >dirty tracking, etc. Two limitations are: 1) not supporting VFIO device >under a PCI bridge with emulated device, because emulated device wants >IOMMU AS and VFIO device stick to system AS; 2) not supporting kexec or >reboot from "intel_iommu=on,sm_on" to "intel_iommu=on,sm_off", because >VFIO device's default hwpt is created with NEST_PARENT flag, kernel >inhibit RO mappings when switch to shadow mode. > >This series is also a prerequisite work for vSVA, i.e. Sharing guest >application address space with passthrough devices. > >There are some interactions between VFIO and vIOMMU >* vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI > subsystem. VFIO calls them to register/unregister HostIOMMUDevice > instance to vIOMMU at vfio device realize stage. >* vIOMMU registers PCIIOMMUOps get_viommu_cap to PCI subsystem. > VFIO calls it to get vIOMMU exposed capabilities. >* vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt > to bind/unbind device to IOMMUFD backed domains, either nested > domain or not. > >See below diagram: > > VFIO Device Intel IOMMU > .-----------------. .-------------------. > | | | >| > | .---------|PCIIOMMUOps |.-------------. | > | | IOMMUFD |(set/unset_iommu_device) || Host IOMMU | >| > | | Device |------------------------>|| Device list | | > | .---------|(get_viommu_cap) |.-------------. | > | | | | >| > | | | V >| > | .---------| HostIOMMUDeviceIOMMUFD | .-------------. | > | | IOMMUFD | (attach_hwpt)| | Host IOMMU >| | > | | link |<------------------------| | Device | | > | .---------| (detach_hwpt)| .-------------. | > | | | | >| > | | | ... >| > .-----------------. .-------------------. > >Below is an example to enable stage-1 translation for passthrough device: > > -M q35,... > -device intel-iommu,x-scalable-mode=on,x-flts=on... > -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,... > >Test done: >- VFIO devices hotplug/unplug >- different VFIO devices linked to different iommufds >- vhost net device ping test > >PATCH1-6: Some preparing work >PATCH7-8: Compatibility check between vIOMMU and Host IOMMU >PATCH9-17: Implement stage-1 page table for passthrough device >PATCH18-19:Workaround for ERRATA_772415_SPR17 >PATCH20: Enable stage-1 translation for passthrough device > >Qemu code can be found at [2] > >Fault report isn't supported in this series, we presume guest kernel always >construct correct stage1 page table for passthrough device. For emulated >devices, the emulation code already provided stage1 fault injection. > >TODO: >- Fault report to guest when HW stage1 faults > >[1] >https://patchwork.kernel.org/project/kvm/cover/20210302203827.437645-1 >-yi.l....@intel.com/ >[2] https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_nesting.v4 > >Thanks >Zhenzhong > >Changelog: >v4: >- s/VIOMMU_CAP_STAGE1/VIOMMU_CAP_HW_NESTED (Eric, Nicolin, >Donald, Shameer) >- clarify get_viommu_cap() return pure emulated caps and explain reason in >commit log (Eric) >- retrieve the ce only if vtd_as->pasid in vtd_as_to_iommu_pasid_locked (Eric) >- refine doc comment and commit log in patch10-11 (Eric) > >v3: >- define enum type for VIOMMU_CAP_* (Eric) >- drop inline flag in the patch which uses the helper (Eric) >- use extract64 in new introduced MACRO (Eric) >- polish comments and fix typo error (Eric) >- split workaround patch for ERRATA_772415_SPR17 to two patches (Eric) >- optimize bind/unbind error path processing > >v2: >- introduce get_viommu_cap() to get STAGE1 flag to create nested parent >hwpt (Liuyi) >- reuse VFIO's default hwpt as parent hwpt of nested translation (Nicolin, >Liuyi) >- abandon support of VFIO device under pcie-to-pci bridge to simplify design >(Liuyi) >- bypass RO mapping in VFIO's default hwpt if ERRATA_772415_SPR17 (Liuyi) >- drop vtd_dev_to_context_entry optimization (Liuyi) > >v1: >- simplify vendor specific checking in vtd_check_hiod (Cedric, Nicolin) >- rebase to master > >rfcv3: >- s/hwpt_id/id in iommufd_backend_invalidate_cache()'s parameter >(Shameer) >- hide vtd vendor specific caps in a wrapper union (Eric, Nicolin) >- simplify return value check of get_cap() (Eric) >- drop realize_late (Cedric, Eric) >- split patch13:intel_iommu: Add PASID cache management infrastructure >(Eric) >- s/vtd_pasid_cache_reset/vtd_pasid_cache_reset_locked (Eric) >- s/vtd_pe_get_domain_id/vtd_pe_get_did (Eric) >- refine comments (Eric, Donald) > >rfcv2: >- Drop VTDPASIDAddressSpace and use VTDAddressSpace (Eric, Liuyi) >- Move HWPT uAPI patches ahead(patch1-8) so arm nesting could easily >rebase >- add two cleanup patches(patch9-10) >- VFIO passes iommufd/devid/hwpt_id to vIOMMU instead of >iommufd/devid/ioas_id >- add vtd_as_[from|to]_iommu_pasid() helper to translate between vtd_as >and > iommu pasid, this is important for dropping VTDPASIDAddressSpace > > >Yi Liu (3): > intel_iommu: Replay pasid bindings after context cache invalidation > intel_iommu: Propagate PASID-based iotlb invalidation to host > intel_iommu: Replay all pasid bindings when either SRTP or TE bit is > changed > >Zhenzhong Duan (17): > intel_iommu: Rename vtd_ce_get_rid2pasid_entry to > vtd_ce_get_pasid_entry > hw/pci: Introduce pci_device_get_viommu_cap() > intel_iommu: Implement get_viommu_cap() callback > vfio/iommufd: Force creating nested parent domain > hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool > intel_iommu: Introduce a new structure VTDHostIOMMUDevice > intel_iommu: Check for compatibility with IOMMUFD backed device when > x-flts=on > intel_iommu: Fail passthrough device under PCI bridge if x-flts=on > intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked > intel_iommu: Handle PASID entry removal and update > intel_iommu: Handle PASID entry addition > intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET > intel_iommu: Stick to system MR for IOMMUFD backed host device when > x-fls=on > intel_iommu: Bind/unbind guest page table to host > vfio: Add a new element bypass_ro in VFIOContainerBase > Workaround for ERRATA_772415_SPR17 > intel_iommu: Enable host device when x-flts=on in scalable mode > > MAINTAINERS | 1 + > hw/i386/intel_iommu_internal.h | 68 +- > include/hw/i386/intel_iommu.h | 9 +- > include/hw/iommu.h | 17 + > include/hw/pci/pci.h | 27 + > include/hw/vfio/vfio-container-base.h | 1 + > hw/i386/intel_iommu.c | 941 >+++++++++++++++++++++++++- > hw/pci/pci.c | 23 +- > hw/vfio/iommufd.c | 22 +- > hw/vfio/listener.c | 13 +- > hw/i386/trace-events | 8 + > 11 files changed, 1088 insertions(+), 42 deletions(-) > create mode 100644 include/hw/iommu.h > > >base-commit: 92c05be4dfb59a71033d4c57dac944b29f7dabf0 >-- >2.47.1