On 8/21/25 10:50 AM, Yi Liu wrote:
> On 2025/8/21 15:19, Duan, Zhenzhong wrote:
>> Kindly ping, any more comments?
>
> Do you have enough comments for a new version. I plan to have a look
> either this version or a new version next week. :)
same for me ;-)
Eric
>
> Regards,
> Yi Liu
>
>> Thanks
>> Zhenzhong
>>
>>> -----Original Message-----
>>> From: Duan, Zhenzhong <zhenzhong.d...@intel.com>
>>> Subject: [PATCH v4 00/20] intel_iommu: Enable stage-1 translation for
>>> passthrough device
>>>
>>> Hi,
>>>
>>> For passthrough device with intel_iommu.x-flts=on, we don't do
>>> shadowing
>>> of
>>> guest page table for passthrough device but pass stage-1 page table
>>> to host
>>> side to construct a nested domain. There was some effort to enable this
>>> feature
>>> in old days, see [1] for details.
>>>
>>> The key design is to utilize the dual-stage IOMMU translation (also
>>> known as
>>> IOMMU nested translation) capability in host IOMMU. As the below
>>> diagram
>>> shows,
>>> guest I/O page table pointer in GPA (guest physical address) is
>>> passed to host
>>> and be used to perform the stage-1 address translation. Along with it,
>>> modifications to present mappings in the guest I/O page table should be
>>> followed
>>> with an IOTLB invalidation.
>>>
>>> .-------------. .---------------------------.
>>> | vIOMMU | | Guest I/O page table |
>>> | | '---------------------------'
>>> .----------------/
>>> | PASID Entry |--- PASID cache flush --+
>>> '-------------' |
>>> | | V
>>> | | I/O page table pointer in GPA
>>> '-------------'
>>> Guest
>>> ------| Shadow |---------------------------|--------
>>> v v v
>>> Host
>>> .-------------. .------------------------.
>>> | pIOMMU | | Stage1 for GIOVA->GPA |
>>> | | '------------------------'
>>> .----------------/ |
>>> | PASID Entry | V (Nested xlate)
>>> '----------------\.--------------------------------------.
>>> | | | Stage2 for GPA->HPA, unmanaged domain|
>>> | | '--------------------------------------'
>>> '-------------'
>>> For history reason, there are different namings in different VTD
>>> spec rev,
>>> Where:
>>> - Stage1 = First stage = First level = flts
>>> - Stage2 = Second stage = Second level = slts
>>> <Intel VT-d Nested translation>
>>>
>>> This series reuse VFIO device's default hwpt as nested parent
>>> instead of
>>> creating new one. This way avoids duplicate code of a new memory
>>> listener,
>>> all existing feature from VFIO listener can be shared, e.g., ram
>>> discard,
>>> dirty tracking, etc. Two limitations are: 1) not supporting VFIO device
>>> under a PCI bridge with emulated device, because emulated device wants
>>> IOMMU AS and VFIO device stick to system AS; 2) not supporting kexec or
>>> reboot from "intel_iommu=on,sm_on" to "intel_iommu=on,sm_off", because
>>> VFIO device's default hwpt is created with NEST_PARENT flag, kernel
>>> inhibit RO mappings when switch to shadow mode.
>>>
>>> This series is also a prerequisite work for vSVA, i.e. Sharing guest
>>> application address space with passthrough devices.
>>>
>>> There are some interactions between VFIO and vIOMMU
>>> * vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI
>>> subsystem. VFIO calls them to register/unregister HostIOMMUDevice
>>> instance to vIOMMU at vfio device realize stage.
>>> * vIOMMU registers PCIIOMMUOps get_viommu_cap to PCI subsystem.
>>> VFIO calls it to get vIOMMU exposed capabilities.
>>> * vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt
>>> to bind/unbind device to IOMMUFD backed domains, either nested
>>> domain or not.
>>>
>>> See below diagram:
>>>
>>> VFIO Device Intel IOMMU
>>> .-----------------. .-------------------.
>>> | | |
>>> |
>>> | .---------|PCIIOMMUOps |.-------------. |
>>> | | IOMMUFD |(set/unset_iommu_device) || Host IOMMU |
>>> |
>>> | | Device |------------------------>|| Device list | |
>>> | .---------|(get_viommu_cap) |.-------------. |
>>> | | | |
>>> |
>>> | | | V
>>> |
>>> | .---------| HostIOMMUDeviceIOMMUFD | .-------------. |
>>> | | IOMMUFD | (attach_hwpt)| | Host IOMMU
>>> | |
>>> | | link |<------------------------| | Device | |
>>> | .---------| (detach_hwpt)| .-------------. |
>>> | | | |
>>> |
>>> | | | ...
>>> |
>>> .-----------------. .-------------------.
>>>
>>> Below is an example to enable stage-1 translation for passthrough
>>> device:
>>>
>>> -M q35,...
>>> -device intel-iommu,x-scalable-mode=on,x-flts=on...
>>> -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,...
>>>
>>> Test done:
>>> - VFIO devices hotplug/unplug
>>> - different VFIO devices linked to different iommufds
>>> - vhost net device ping test
>>>
>>> PATCH1-6: Some preparing work
>>> PATCH7-8: Compatibility check between vIOMMU and Host IOMMU
>>> PATCH9-17: Implement stage-1 page table for passthrough device
>>> PATCH18-19:Workaround for ERRATA_772415_SPR17
>>> PATCH20: Enable stage-1 translation for passthrough device
>>>
>>> Qemu code can be found at [2]
>>>
>>> Fault report isn't supported in this series, we presume guest kernel
>>> always
>>> construct correct stage1 page table for passthrough device. For
>>> emulated
>>> devices, the emulation code already provided stage1 fault injection.
>>>
>>> TODO:
>>> - Fault report to guest when HW stage1 faults
>>>
>>> [1]
>>> https://patchwork.kernel.org/project/kvm/cover/20210302203827.437645-1
>>> -yi.l....@intel.com/
>>> [2] https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_nesting.v4
>>>
>>> Thanks
>>> Zhenzhong
>>>
>>> Changelog:
>>> v4:
>>> - s/VIOMMU_CAP_STAGE1/VIOMMU_CAP_HW_NESTED (Eric, Nicolin,
>>> Donald, Shameer)
>>> - clarify get_viommu_cap() return pure emulated caps and explain
>>> reason in
>>> commit log (Eric)
>>> - retrieve the ce only if vtd_as->pasid in
>>> vtd_as_to_iommu_pasid_locked (Eric)
>>> - refine doc comment and commit log in patch10-11 (Eric)
>>>
>>> v3:
>>> - define enum type for VIOMMU_CAP_* (Eric)
>>> - drop inline flag in the patch which uses the helper (Eric)
>>> - use extract64 in new introduced MACRO (Eric)
>>> - polish comments and fix typo error (Eric)
>>> - split workaround patch for ERRATA_772415_SPR17 to two patches (Eric)
>>> - optimize bind/unbind error path processing
>>>
>>> v2:
>>> - introduce get_viommu_cap() to get STAGE1 flag to create nested parent
>>> hwpt (Liuyi)
>>> - reuse VFIO's default hwpt as parent hwpt of nested translation
>>> (Nicolin,
>>> Liuyi)
>>> - abandon support of VFIO device under pcie-to-pci bridge to
>>> simplify design
>>> (Liuyi)
>>> - bypass RO mapping in VFIO's default hwpt if ERRATA_772415_SPR17
>>> (Liuyi)
>>> - drop vtd_dev_to_context_entry optimization (Liuyi)
>>>
>>> v1:
>>> - simplify vendor specific checking in vtd_check_hiod (Cedric, Nicolin)
>>> - rebase to master
>>>
>>> rfcv3:
>>> - s/hwpt_id/id in iommufd_backend_invalidate_cache()'s parameter
>>> (Shameer)
>>> - hide vtd vendor specific caps in a wrapper union (Eric, Nicolin)
>>> - simplify return value check of get_cap() (Eric)
>>> - drop realize_late (Cedric, Eric)
>>> - split patch13:intel_iommu: Add PASID cache management infrastructure
>>> (Eric)
>>> - s/vtd_pasid_cache_reset/vtd_pasid_cache_reset_locked (Eric)
>>> - s/vtd_pe_get_domain_id/vtd_pe_get_did (Eric)
>>> - refine comments (Eric, Donald)
>>>
>>> rfcv2:
>>> - Drop VTDPASIDAddressSpace and use VTDAddressSpace (Eric, Liuyi)
>>> - Move HWPT uAPI patches ahead(patch1-8) so arm nesting could easily
>>> rebase
>>> - add two cleanup patches(patch9-10)
>>> - VFIO passes iommufd/devid/hwpt_id to vIOMMU instead of
>>> iommufd/devid/ioas_id
>>> - add vtd_as_[from|to]_iommu_pasid() helper to translate between vtd_as
>>> and
>>> iommu pasid, this is important for dropping VTDPASIDAddressSpace
>>>
>>>
>>> Yi Liu (3):
>>> intel_iommu: Replay pasid bindings after context cache invalidation
>>> intel_iommu: Propagate PASID-based iotlb invalidation to host
>>> intel_iommu: Replay all pasid bindings when either SRTP or TE bit is
>>> changed
>>>
>>> Zhenzhong Duan (17):
>>> intel_iommu: Rename vtd_ce_get_rid2pasid_entry to
>>> vtd_ce_get_pasid_entry
>>> hw/pci: Introduce pci_device_get_viommu_cap()
>>> intel_iommu: Implement get_viommu_cap() callback
>>> vfio/iommufd: Force creating nested parent domain
>>> hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool
>>> intel_iommu: Introduce a new structure VTDHostIOMMUDevice
>>> intel_iommu: Check for compatibility with IOMMUFD backed device when
>>> x-flts=on
>>> intel_iommu: Fail passthrough device under PCI bridge if x-flts=on
>>> intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked
>>> intel_iommu: Handle PASID entry removal and update
>>> intel_iommu: Handle PASID entry addition
>>> intel_iommu: Introduce a new pasid cache invalidation type
>>> FORCE_RESET
>>> intel_iommu: Stick to system MR for IOMMUFD backed host device when
>>> x-fls=on
>>> intel_iommu: Bind/unbind guest page table to host
>>> vfio: Add a new element bypass_ro in VFIOContainerBase
>>> Workaround for ERRATA_772415_SPR17
>>> intel_iommu: Enable host device when x-flts=on in scalable mode
>>>
>>> MAINTAINERS | 1 +
>>> hw/i386/intel_iommu_internal.h | 68 +-
>>> include/hw/i386/intel_iommu.h | 9 +-
>>> include/hw/iommu.h | 17 +
>>> include/hw/pci/pci.h | 27 +
>>> include/hw/vfio/vfio-container-base.h | 1 +
>>> hw/i386/intel_iommu.c | 941
>>> +++++++++++++++++++++++++-
>>> hw/pci/pci.c | 23 +-
>>> hw/vfio/iommufd.c | 22 +-
>>> hw/vfio/listener.c | 13 +-
>>> hw/i386/trace-events | 8 +
>>> 11 files changed, 1088 insertions(+), 42 deletions(-)
>>> create mode 100644 include/hw/iommu.h
>>>
>>>
>>> base-commit: 92c05be4dfb59a71033d4c57dac944b29f7dabf0
>>> --
>>> 2.47.1
>>
>