On 8/21/25 10:50 AM, Yi Liu wrote:
> On 2025/8/21 15:19, Duan, Zhenzhong wrote:
>> Kindly ping, any more comments?
>
> Do you have enough comments for a new version. I plan to have a look
> either this version or a new version next week. :)
same for me ;-)

Eric
>
> Regards,
> Yi Liu
>
>> Thanks
>> Zhenzhong
>>
>>> -----Original Message-----
>>> From: Duan, Zhenzhong <zhenzhong.d...@intel.com>
>>> Subject: [PATCH v4 00/20] intel_iommu: Enable stage-1 translation for
>>> passthrough device
>>>
>>> Hi,
>>>
>>> For passthrough device with intel_iommu.x-flts=on, we don't do
>>> shadowing
>>> of
>>> guest page table for passthrough device but pass stage-1 page table
>>> to host
>>> side to construct a nested domain. There was some effort to enable this
>>> feature
>>> in old days, see [1] for details.
>>>
>>> The key design is to utilize the dual-stage IOMMU translation (also
>>> known as
>>> IOMMU nested translation) capability in host IOMMU. As the below
>>> diagram
>>> shows,
>>> guest I/O page table pointer in GPA (guest physical address) is
>>> passed to host
>>> and be used to perform the stage-1 address translation. Along with it,
>>> modifications to present mappings in the guest I/O page table should be
>>> followed
>>> with an IOTLB invalidation.
>>>
>>>         .-------------.  .---------------------------.
>>>         |   vIOMMU    |  | Guest I/O page table      |
>>>         |             |  '---------------------------'
>>>         .----------------/
>>>         | PASID Entry |--- PASID cache flush --+
>>>         '-------------'                        |
>>>         |             |                        V
>>>         |             |           I/O page table pointer in GPA
>>>         '-------------'
>>>     Guest
>>>     ------| Shadow |---------------------------|--------
>>>           v        v                           v
>>>     Host
>>>         .-------------.  .------------------------.
>>>         |   pIOMMU    |  | Stage1 for GIOVA->GPA  |
>>>         |             |  '------------------------'
>>>         .----------------/  |
>>>         | PASID Entry |     V (Nested xlate)
>>>         '----------------\.--------------------------------------.
>>>         |             |   | Stage2 for GPA->HPA, unmanaged domain|
>>>         |             |   '--------------------------------------'
>>>         '-------------'
>>> For history reason, there are different namings in different VTD
>>> spec rev,
>>> Where:
>>> - Stage1 = First stage = First level = flts
>>> - Stage2 = Second stage = Second level = slts
>>> <Intel VT-d Nested translation>
>>>
>>> This series reuse VFIO device's default hwpt as nested parent
>>> instead of
>>> creating new one. This way avoids duplicate code of a new memory
>>> listener,
>>> all existing feature from VFIO listener can be shared, e.g., ram
>>> discard,
>>> dirty tracking, etc. Two limitations are: 1) not supporting VFIO device
>>> under a PCI bridge with emulated device, because emulated device wants
>>> IOMMU AS and VFIO device stick to system AS; 2) not supporting kexec or
>>> reboot from "intel_iommu=on,sm_on" to "intel_iommu=on,sm_off", because
>>> VFIO device's default hwpt is created with NEST_PARENT flag, kernel
>>> inhibit RO mappings when switch to shadow mode.
>>>
>>> This series is also a prerequisite work for vSVA, i.e. Sharing guest
>>> application address space with passthrough devices.
>>>
>>> There are some interactions between VFIO and vIOMMU
>>> * vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI
>>>   subsystem. VFIO calls them to register/unregister HostIOMMUDevice
>>>   instance to vIOMMU at vfio device realize stage.
>>> * vIOMMU registers PCIIOMMUOps get_viommu_cap to PCI subsystem.
>>>   VFIO calls it to get vIOMMU exposed capabilities.
>>> * vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt
>>>   to bind/unbind device to IOMMUFD backed domains, either nested
>>>   domain or not.
>>>
>>> See below diagram:
>>>
>>>         VFIO Device                                 Intel IOMMU
>>>     .-----------------.                         .-------------------.
>>>     |                 |                         |
>>> |
>>>     |       .---------|PCIIOMMUOps              |.-------------.    |
>>>     |       | IOMMUFD |(set/unset_iommu_device) || Host IOMMU  |
>>> |
>>>     |       | Device  |------------------------>|| Device list |    |
>>>     |       .---------|(get_viommu_cap)         |.-------------.    |
>>>     |                 |                         |       |
>>> |
>>>     |                 |                         |       V
>>> |
>>>     |       .---------|  HostIOMMUDeviceIOMMUFD |  .-------------.  |
>>>     |       | IOMMUFD |            (attach_hwpt)|  | Host IOMMU
>>> |  |
>>>     |       | link    |<------------------------|  |   Device    |  |
>>>     |       .---------|            (detach_hwpt)|  .-------------.  |
>>>     |                 |                         |       |
>>> |
>>>     |                 |                         |       ...
>>> |
>>>     .-----------------.                         .-------------------.
>>>
>>> Below is an example to enable stage-1 translation for passthrough
>>> device:
>>>
>>>     -M q35,...
>>>     -device intel-iommu,x-scalable-mode=on,x-flts=on...
>>>     -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,...
>>>
>>> Test done:
>>> - VFIO devices hotplug/unplug
>>> - different VFIO devices linked to different iommufds
>>> - vhost net device ping test
>>>
>>> PATCH1-6:  Some preparing work
>>> PATCH7-8:  Compatibility check between vIOMMU and Host IOMMU
>>> PATCH9-17: Implement stage-1 page table for passthrough device
>>> PATCH18-19:Workaround for ERRATA_772415_SPR17
>>> PATCH20:   Enable stage-1 translation for passthrough device
>>>
>>> Qemu code can be found at [2]
>>>
>>> Fault report isn't supported in this series, we presume guest kernel
>>> always
>>> construct correct stage1 page table for passthrough device. For
>>> emulated
>>> devices, the emulation code already provided stage1 fault injection.
>>>
>>> TODO:
>>> - Fault report to guest when HW stage1 faults
>>>
>>> [1]
>>> https://patchwork.kernel.org/project/kvm/cover/20210302203827.437645-1
>>> -yi.l....@intel.com/
>>> [2] https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_nesting.v4
>>>
>>> Thanks
>>> Zhenzhong
>>>
>>> Changelog:
>>> v4:
>>> - s/VIOMMU_CAP_STAGE1/VIOMMU_CAP_HW_NESTED (Eric, Nicolin,
>>> Donald, Shameer)
>>> - clarify get_viommu_cap() return pure emulated caps and explain
>>> reason in
>>> commit log (Eric)
>>> - retrieve the ce only if vtd_as->pasid in
>>> vtd_as_to_iommu_pasid_locked (Eric)
>>> - refine doc comment and commit log in patch10-11 (Eric)
>>>
>>> v3:
>>> - define enum type for VIOMMU_CAP_* (Eric)
>>> - drop inline flag in the patch which uses the helper (Eric)
>>> - use extract64 in new introduced MACRO (Eric)
>>> - polish comments and fix typo error (Eric)
>>> - split workaround patch for ERRATA_772415_SPR17 to two patches (Eric)
>>> - optimize bind/unbind error path processing
>>>
>>> v2:
>>> - introduce get_viommu_cap() to get STAGE1 flag to create nested parent
>>> hwpt (Liuyi)
>>> - reuse VFIO's default hwpt as parent hwpt of nested translation
>>> (Nicolin,
>>> Liuyi)
>>> - abandon support of VFIO device under pcie-to-pci bridge to
>>> simplify design
>>> (Liuyi)
>>> - bypass RO mapping in VFIO's default hwpt if ERRATA_772415_SPR17
>>> (Liuyi)
>>> - drop vtd_dev_to_context_entry optimization (Liuyi)
>>>
>>> v1:
>>> - simplify vendor specific checking in vtd_check_hiod (Cedric, Nicolin)
>>> - rebase to master
>>>
>>> rfcv3:
>>> - s/hwpt_id/id in iommufd_backend_invalidate_cache()'s parameter
>>> (Shameer)
>>> - hide vtd vendor specific caps in a wrapper union (Eric, Nicolin)
>>> - simplify return value check of get_cap() (Eric)
>>> - drop realize_late (Cedric, Eric)
>>> - split patch13:intel_iommu: Add PASID cache management infrastructure
>>> (Eric)
>>> - s/vtd_pasid_cache_reset/vtd_pasid_cache_reset_locked (Eric)
>>> - s/vtd_pe_get_domain_id/vtd_pe_get_did (Eric)
>>> - refine comments (Eric, Donald)
>>>
>>> rfcv2:
>>> - Drop VTDPASIDAddressSpace and use VTDAddressSpace (Eric, Liuyi)
>>> - Move HWPT uAPI patches ahead(patch1-8) so arm nesting could easily
>>> rebase
>>> - add two cleanup patches(patch9-10)
>>> - VFIO passes iommufd/devid/hwpt_id to vIOMMU instead of
>>> iommufd/devid/ioas_id
>>> - add vtd_as_[from|to]_iommu_pasid() helper to translate between vtd_as
>>> and
>>>   iommu pasid, this is important for dropping VTDPASIDAddressSpace
>>>
>>>
>>> Yi Liu (3):
>>>   intel_iommu: Replay pasid bindings after context cache invalidation
>>>   intel_iommu: Propagate PASID-based iotlb invalidation to host
>>>   intel_iommu: Replay all pasid bindings when either SRTP or TE bit is
>>>     changed
>>>
>>> Zhenzhong Duan (17):
>>>   intel_iommu: Rename vtd_ce_get_rid2pasid_entry to
>>>     vtd_ce_get_pasid_entry
>>>   hw/pci: Introduce pci_device_get_viommu_cap()
>>>   intel_iommu: Implement get_viommu_cap() callback
>>>   vfio/iommufd: Force creating nested parent domain
>>>   hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool
>>>   intel_iommu: Introduce a new structure VTDHostIOMMUDevice
>>>   intel_iommu: Check for compatibility with IOMMUFD backed device when
>>>     x-flts=on
>>>   intel_iommu: Fail passthrough device under PCI bridge if x-flts=on
>>>   intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked
>>>   intel_iommu: Handle PASID entry removal and update
>>>   intel_iommu: Handle PASID entry addition
>>>   intel_iommu: Introduce a new pasid cache invalidation type
>>> FORCE_RESET
>>>   intel_iommu: Stick to system MR for IOMMUFD backed host device when
>>>     x-fls=on
>>>   intel_iommu: Bind/unbind guest page table to host
>>>   vfio: Add a new element bypass_ro in VFIOContainerBase
>>>   Workaround for ERRATA_772415_SPR17
>>>   intel_iommu: Enable host device when x-flts=on in scalable mode
>>>
>>> MAINTAINERS                           |   1 +
>>> hw/i386/intel_iommu_internal.h        |  68 +-
>>> include/hw/i386/intel_iommu.h         |   9 +-
>>> include/hw/iommu.h                    |  17 +
>>> include/hw/pci/pci.h                  |  27 +
>>> include/hw/vfio/vfio-container-base.h |   1 +
>>> hw/i386/intel_iommu.c                 | 941
>>> +++++++++++++++++++++++++-
>>> hw/pci/pci.c                          |  23 +-
>>> hw/vfio/iommufd.c                     |  22 +-
>>> hw/vfio/listener.c                    |  13 +-
>>> hw/i386/trace-events                  |   8 +
>>> 11 files changed, 1088 insertions(+), 42 deletions(-)
>>> create mode 100644 include/hw/iommu.h
>>>
>>>
>>> base-commit: 92c05be4dfb59a71033d4c57dac944b29f7dabf0
>>> -- 
>>> 2.47.1
>>
>


Reply via email to