Hi All,

Kindly ping, comments welcome😊
There are still some patches lacking a R-b: #2, #9-#11, #13-#14, #19-#21.

Hi Eric, Yi, would like to know if your previous R-b on #9-#11, #13-#14 still 
stands.
I dropped them due to code movement into intel_iommu_accel.c

Thanks
Zhenzhong

>-----Original Message-----
>From: Duan, Zhenzhong <[email protected]>
>Subject: [PATCH v8 00/23] intel_iommu: Enable first stage translation for
>passthrough device
>
>Hi,
>
>For passthrough device with intel_iommu.x-flts=on, we don't do shadowing
>of
>guest page table but pass first stage page table to host side to construct a
>nested HWPT. There was some effort to enable this feature in old days, see
>[1] for details.
>
>The key design is to utilize the dual-stage IOMMU translation (also known as
>IOMMU nested translation) capability in host IOMMU. As the below diagram
>shows,
>guest I/O page table pointer in GPA (guest physical address) is passed to host
>and be used to perform the first stage address translation. Along with it,
>modifications to present mappings in the guest I/O page table should be
>followed
>with an IOTLB invalidation.
>
>        .-------------.  .---------------------------.
>        |   vIOMMU    |  | Guest I/O page table      |
>        |             |  '---------------------------'
>        .----------------/
>        | PASID Entry |--- PASID cache flush --+
>        '-------------'                        |
>        |             |                        V
>        |             |           I/O page table pointer in GPA
>        '-------------'
>    Guest
>    ------| Shadow |---------------------------|--------
>          v        v                           v
>    Host
>        .-------------.  .-----------------------------.
>        |   pIOMMU    |  | First stage for GIOVA->GPA  |
>        |             |  '-----------------------------'
>        .----------------/  |
>        | PASID Entry |     V (Nested xlate)
>        '----------------\.--------------------------------------------.
>        |             |   | Second stage for GPA->HPA, unmanaged
>domain|
>        |             |   '--------------------------------------------'
>        '-------------'
><Intel VT-d Nested translation>
>
>This series reuse VFIO device's default HWPT as nesting parent instead of
>creating new one. This way avoids duplicate code of a new memory listener,
>all existing feature from VFIO listener can be shared, e.g., ram discard,
>dirty tracking, etc. Two limitations are: 1) not supporting VFIO device
>under a PCI bridge with emulated device, because emulated device wants
>IOMMU AS and VFIO device stick to system AS; 2) not supporting kexec or
>reboot from "intel_iommu=on,sm_on" to "intel_iommu=on,sm_off" on
>platform
>with ERRATA_772415_SPR17, because VFIO device's default HWPT is created
>with NEST_PARENT flag, kernel inhibit RO mappings when switch to shadow
>mode.
>
>This series is also a prerequisite work for vSVA, i.e. Sharing guest
>application address space with passthrough devices.
>
>There are some interactions between VFIO and vIOMMU
>* vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI
>  subsystem. VFIO calls them to register/unregister HostIOMMUDevice
>  instance to vIOMMU at vfio device realize stage.
>* vIOMMU registers PCIIOMMUOps get_viommu_flags to PCI subsystem.
>  VFIO calls it to get vIOMMU exposed flags.
>* vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt
>  to bind/unbind device to IOMMUFD backed domains, either nested
>  domain or not.
>
>See below diagram:
>
>        VFIO Device                                 Intel IOMMU
>    .-----------------.                         .-------------------.
>    |                 |                         |
>|
>    |       .---------|PCIIOMMUOps              |.-------------.    |
>    |       | IOMMUFD |(set/unset_iommu_device) || Host IOMMU  |
>|
>    |       | Device  |------------------------>|| Device list |    |
>    |       .---------|(get_viommu_flags)       |.-------------.    |
>    |                 |                         |       |
>|
>    |                 |                         |       V
>|
>    |       .---------|  HostIOMMUDeviceIOMMUFD |  .-------------.  |
>    |       | IOMMUFD |            (attach_hwpt)|  | Host IOMMU
>|  |
>    |       | link    |<------------------------|  |   Device    |  |
>    |       .---------|            (detach_hwpt)|  .-------------.  |
>    |                 |                         |       |
>|
>    |                 |                         |       ...
>|
>    .-----------------.                         .-------------------.
>
>Below is an example to enable first stage translation for passthrough device:
>
>    -M q35,...
>    -device intel-iommu,x-scalable-mode=on,x-flts=on...
>    -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,...
>
>Test done:
>- VFIO devices hotplug/unplug
>- different VFIO devices linked to different iommufds
>- vhost net device ping test
>- migration with QAT passthrough
>
>PATCH01-08: Some preparing work
>PATCH09-10: Compatibility check between vIOMMU and Host IOMMU
>PATCH11-16: Implement first stage translation for passthrough device
>PATCH17-18: Add migration support and optimization
>PATCH19-21: Workaround for ERRATA_772415_SPR17
>PATCH22:    Enable first stage translation for passthrough device
>PATCH23:    Add doc
>
>Qemu code can be found at [2], it's based on
>vfio-next + migration_relax_series[3].
>
>Fault event injection to guest isn't supported in this series, we presume guest
>kernel always construct correct first stage page table for passthrough device.
>For emulated devices, the emulation code already provided first stage fault
>injection.
>
>TODO:
>- Fault event injection to guest when HW first stage page table faults
>
>[1]
>https://patchwork.kernel.org/project/kvm/cover/20210302203827.437645-1
>[email protected]/
>[2] https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_nesting.v8
>[3]
>https://lore.kernel.org/qemu-devel/20251106042027.856594-1-zhenzhong.d
>[email protected]/
>
>Thanks
>Zhenzhong
>
>Changelog:
>v8:
>- add hw/i386/intel_iommu_accel.[hc] to hold accel code (Eric)
>- return bool for all vtd accel related functions (Cedric, Eric)
>- introduce a new PCIIOMMUOps::get_host_iommu_quirks() (Eric, Nicolin)
>- minor polishment to comment and code (Cedric, Eric)
>- drop some R-b as they have changes needing review again
>
>v7:
>- s/host_iommu_extract_vendor_caps/host_iommu_extract_quirks (Nicolin)
>- s/RID_PASID/PASID_0 (Eric)
>- drop rid2pasid check in vtd_do_iommu_translate (Eric)
>- refine DID check in vtd_pasid_cache_sync_locked (Liuyi)
>- refine commit log (Nicolin, Eric, Liuyi)
>- Fix doc build (Cedric)
>- add migration support
>
>v6:
>- delete RPS capability related supporting code (Eric, Yi)
>- use terminology 'first/second stage' to replace 'first/second level" (Eric, 
>Yi)
>- use get_viommu_flags() instead of get_viommu_caps() (Nicolin)
>- drop non-RID_PASID related code and simplify pasid invalidation handling
>(Eric, Yi)
>- drop the patch that handle pasid replay when context invalidation (Eric)
>- move vendor specific cap check from VFIO core to backend/iommufd.c
>(Nicolin)
>
>v5:
>- refine commit log of patch2 (Cedric, Nicolin)
>- introduce helper vfio_pci_from_vfio_device() (Cedric)
>- introduce helper vfio_device_viommu_get_nested() (Cedric)
>- pass 'bool bypass_ro' argument to vfio_listener_valid_section() instead of
>'VFIOContainerBase *' (Cedric)
>- fix a potential build error reported by Jim Shu
>
>v4:
>- s/VIOMMU_CAP_STAGE1/VIOMMU_CAP_HW_NESTED (Eric, Nicolin,
>Donald, Shameer)
>- clarify get_viommu_cap() return pure emulated caps and explain reason in
>commit log (Eric)
>- retrieve the ce only if vtd_as->pasid in vtd_as_to_iommu_pasid_locked (Eric)
>- refine doc comment and commit log in patch10-11 (Eric)
>
>v3:
>- define enum type for VIOMMU_CAP_* (Eric)
>- drop inline flag in the patch which uses the helper (Eric)
>- use extract64 in new introduced MACRO (Eric)
>- polish comments and fix typo error (Eric)
>- split workaround patch for ERRATA_772415_SPR17 to two patches (Eric)
>- optimize bind/unbind error path processing
>
>v2:
>- introduce get_viommu_cap() to get STAGE1 flag to create nesting parent
>HWPT (Liuyi)
>- reuse VFIO's default HWPT as parent HWPT of nested translation (Nicolin,
>Liuyi)
>- abandon support of VFIO device under pcie-to-pci bridge to simplify design
>(Liuyi)
>- bypass RO mapping in VFIO's default HWPT if ERRATA_772415_SPR17 (Liuyi)
>- drop vtd_dev_to_context_entry optimization (Liuyi)
>
>v1:
>- simplify vendor specific checking in vtd_check_hiod (Cedric, Nicolin)
>- rebase to master
>
>
>Yi Liu (3):
>  intel_iommu_accel: Propagate PASID-based iotlb invalidation to host
>  intel_iommu: Replay all pasid bindings when either SRTP or TE bit is
>    changed
>  intel_iommu: Replay pasid bindings after context cache invalidation
>
>Zhenzhong Duan (20):
>  intel_iommu: Rename vtd_ce_get_rid2pasid_entry to
>    vtd_ce_get_pasid_entry
>  intel_iommu: Delete RPS capability related supporting code
>  intel_iommu: Update terminology to match VTD spec
>  hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool
>  hw/pci: Introduce pci_device_get_viommu_flags()
>  intel_iommu: Implement get_viommu_flags() callback
>  intel_iommu: Introduce a new structure VTDHostIOMMUDevice
>  vfio/iommufd: Force creating nesting parent HWPT
>  intel_iommu_accel: Check for compatibility with IOMMUFD backed device
>    when x-flts=on
>  intel_iommu_accel: Fail passthrough device under PCI bridge if
>    x-flts=on
>  intel_iommu_accel: Stick to system MR for IOMMUFD backed host device
>    when x-flts=on
>  intel_iommu: Add some macros and inline functions
>  intel_iommu_accel: Bind/unbind guest page table to host
>  vfio/listener: Bypass readonly region for dirty tracking
>  intel_iommu: Add migration support with x-flts=on
>  hw/pci: Introduce pci_device_get_host_iommu_quirks()
>  intel_iommu_accel: Implement get_host_iommu_quirks() callback
>  Workaround for ERRATA_772415_SPR17
>  intel_iommu: Enable host device when x-flts=on in scalable mode
>  docs/devel: Add IOMMUFD nesting documentation
>
> MAINTAINERS                      |   2 +
> docs/devel/vfio-iommufd.rst      |  25 ++
> hw/i386/intel_iommu_accel.h      |  55 ++++
> hw/i386/intel_iommu_internal.h   | 155 ++++++---
> include/hw/i386/intel_iommu.h    |   5 +-
> include/hw/iommu.h               |  30 ++
> include/hw/pci/pci.h             |  55 ++++
> include/hw/vfio/vfio-container.h |   1 +
> include/hw/vfio/vfio-device.h    |   5 +
> hw/i386/intel_iommu.c            | 530
>++++++++++++++++++-------------
> hw/i386/intel_iommu_accel.c      | 272 ++++++++++++++++
> hw/pci/pci.c                     |  35 +-
> hw/vfio/device.c                 |  26 ++
> hw/vfio/iommufd.c                |  18 +-
> hw/vfio/listener.c               |  48 ++-
> tests/qtest/intel-iommu-test.c   |   4 +-
> hw/i386/Kconfig                  |   5 +
> hw/i386/meson.build              |   1 +
> hw/i386/trace-events             |   4 +
> hw/vfio/trace-events             |   1 +
> 20 files changed, 979 insertions(+), 298 deletions(-)
> create mode 100644 hw/i386/intel_iommu_accel.h
> create mode 100644 include/hw/iommu.h
> create mode 100644 hw/i386/intel_iommu_accel.c
>
>--
>2.47.1

Reply via email to