Kindly ping, any more comments?

Thanks
Zhenzhong

>-----Original Message-----
>From: Duan, Zhenzhong <zhenzhong.d...@intel.com>
>Subject: [PATCH v4 00/20] intel_iommu: Enable stage-1 translation for
>passthrough device
>
>Hi,
>
>For passthrough device with intel_iommu.x-flts=on, we don't do shadowing
>of
>guest page table for passthrough device but pass stage-1 page table to host
>side to construct a nested domain. There was some effort to enable this
>feature
>in old days, see [1] for details.
>
>The key design is to utilize the dual-stage IOMMU translation (also known as
>IOMMU nested translation) capability in host IOMMU. As the below diagram
>shows,
>guest I/O page table pointer in GPA (guest physical address) is passed to host
>and be used to perform the stage-1 address translation. Along with it,
>modifications to present mappings in the guest I/O page table should be
>followed
>with an IOTLB invalidation.
>
>        .-------------.  .---------------------------.
>        |   vIOMMU    |  | Guest I/O page table      |
>        |             |  '---------------------------'
>        .----------------/
>        | PASID Entry |--- PASID cache flush --+
>        '-------------'                        |
>        |             |                        V
>        |             |           I/O page table pointer in GPA
>        '-------------'
>    Guest
>    ------| Shadow |---------------------------|--------
>          v        v                           v
>    Host
>        .-------------.  .------------------------.
>        |   pIOMMU    |  | Stage1 for GIOVA->GPA  |
>        |             |  '------------------------'
>        .----------------/  |
>        | PASID Entry |     V (Nested xlate)
>        '----------------\.--------------------------------------.
>        |             |   | Stage2 for GPA->HPA, unmanaged domain|
>        |             |   '--------------------------------------'
>        '-------------'
>For history reason, there are different namings in different VTD spec rev,
>Where:
> - Stage1 = First stage = First level = flts
> - Stage2 = Second stage = Second level = slts
><Intel VT-d Nested translation>
>
>This series reuse VFIO device's default hwpt as nested parent instead of
>creating new one. This way avoids duplicate code of a new memory listener,
>all existing feature from VFIO listener can be shared, e.g., ram discard,
>dirty tracking, etc. Two limitations are: 1) not supporting VFIO device
>under a PCI bridge with emulated device, because emulated device wants
>IOMMU AS and VFIO device stick to system AS; 2) not supporting kexec or
>reboot from "intel_iommu=on,sm_on" to "intel_iommu=on,sm_off", because
>VFIO device's default hwpt is created with NEST_PARENT flag, kernel
>inhibit RO mappings when switch to shadow mode.
>
>This series is also a prerequisite work for vSVA, i.e. Sharing guest
>application address space with passthrough devices.
>
>There are some interactions between VFIO and vIOMMU
>* vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI
>  subsystem. VFIO calls them to register/unregister HostIOMMUDevice
>  instance to vIOMMU at vfio device realize stage.
>* vIOMMU registers PCIIOMMUOps get_viommu_cap to PCI subsystem.
>  VFIO calls it to get vIOMMU exposed capabilities.
>* vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt
>  to bind/unbind device to IOMMUFD backed domains, either nested
>  domain or not.
>
>See below diagram:
>
>        VFIO Device                                 Intel IOMMU
>    .-----------------.                         .-------------------.
>    |                 |                         |
>|
>    |       .---------|PCIIOMMUOps              |.-------------.    |
>    |       | IOMMUFD |(set/unset_iommu_device) || Host IOMMU  |
>|
>    |       | Device  |------------------------>|| Device list |    |
>    |       .---------|(get_viommu_cap)         |.-------------.    |
>    |                 |                         |       |
>|
>    |                 |                         |       V
>|
>    |       .---------|  HostIOMMUDeviceIOMMUFD |  .-------------.  |
>    |       | IOMMUFD |            (attach_hwpt)|  | Host IOMMU
>|  |
>    |       | link    |<------------------------|  |   Device    |  |
>    |       .---------|            (detach_hwpt)|  .-------------.  |
>    |                 |                         |       |
>|
>    |                 |                         |       ...
>|
>    .-----------------.                         .-------------------.
>
>Below is an example to enable stage-1 translation for passthrough device:
>
>    -M q35,...
>    -device intel-iommu,x-scalable-mode=on,x-flts=on...
>    -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,...
>
>Test done:
>- VFIO devices hotplug/unplug
>- different VFIO devices linked to different iommufds
>- vhost net device ping test
>
>PATCH1-6:  Some preparing work
>PATCH7-8:  Compatibility check between vIOMMU and Host IOMMU
>PATCH9-17: Implement stage-1 page table for passthrough device
>PATCH18-19:Workaround for ERRATA_772415_SPR17
>PATCH20:   Enable stage-1 translation for passthrough device
>
>Qemu code can be found at [2]
>
>Fault report isn't supported in this series, we presume guest kernel always
>construct correct stage1 page table for passthrough device. For emulated
>devices, the emulation code already provided stage1 fault injection.
>
>TODO:
>- Fault report to guest when HW stage1 faults
>
>[1]
>https://patchwork.kernel.org/project/kvm/cover/20210302203827.437645-1
>-yi.l....@intel.com/
>[2] https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_nesting.v4
>
>Thanks
>Zhenzhong
>
>Changelog:
>v4:
>- s/VIOMMU_CAP_STAGE1/VIOMMU_CAP_HW_NESTED (Eric, Nicolin,
>Donald, Shameer)
>- clarify get_viommu_cap() return pure emulated caps and explain reason in
>commit log (Eric)
>- retrieve the ce only if vtd_as->pasid in vtd_as_to_iommu_pasid_locked (Eric)
>- refine doc comment and commit log in patch10-11 (Eric)
>
>v3:
>- define enum type for VIOMMU_CAP_* (Eric)
>- drop inline flag in the patch which uses the helper (Eric)
>- use extract64 in new introduced MACRO (Eric)
>- polish comments and fix typo error (Eric)
>- split workaround patch for ERRATA_772415_SPR17 to two patches (Eric)
>- optimize bind/unbind error path processing
>
>v2:
>- introduce get_viommu_cap() to get STAGE1 flag to create nested parent
>hwpt (Liuyi)
>- reuse VFIO's default hwpt as parent hwpt of nested translation (Nicolin,
>Liuyi)
>- abandon support of VFIO device under pcie-to-pci bridge to simplify design
>(Liuyi)
>- bypass RO mapping in VFIO's default hwpt if ERRATA_772415_SPR17 (Liuyi)
>- drop vtd_dev_to_context_entry optimization (Liuyi)
>
>v1:
>- simplify vendor specific checking in vtd_check_hiod (Cedric, Nicolin)
>- rebase to master
>
>rfcv3:
>- s/hwpt_id/id in iommufd_backend_invalidate_cache()'s parameter
>(Shameer)
>- hide vtd vendor specific caps in a wrapper union (Eric, Nicolin)
>- simplify return value check of get_cap() (Eric)
>- drop realize_late (Cedric, Eric)
>- split patch13:intel_iommu: Add PASID cache management infrastructure
>(Eric)
>- s/vtd_pasid_cache_reset/vtd_pasid_cache_reset_locked (Eric)
>- s/vtd_pe_get_domain_id/vtd_pe_get_did (Eric)
>- refine comments (Eric, Donald)
>
>rfcv2:
>- Drop VTDPASIDAddressSpace and use VTDAddressSpace (Eric, Liuyi)
>- Move HWPT uAPI patches ahead(patch1-8) so arm nesting could easily
>rebase
>- add two cleanup patches(patch9-10)
>- VFIO passes iommufd/devid/hwpt_id to vIOMMU instead of
>iommufd/devid/ioas_id
>- add vtd_as_[from|to]_iommu_pasid() helper to translate between vtd_as
>and
>  iommu pasid, this is important for dropping VTDPASIDAddressSpace
>
>
>Yi Liu (3):
>  intel_iommu: Replay pasid bindings after context cache invalidation
>  intel_iommu: Propagate PASID-based iotlb invalidation to host
>  intel_iommu: Replay all pasid bindings when either SRTP or TE bit is
>    changed
>
>Zhenzhong Duan (17):
>  intel_iommu: Rename vtd_ce_get_rid2pasid_entry to
>    vtd_ce_get_pasid_entry
>  hw/pci: Introduce pci_device_get_viommu_cap()
>  intel_iommu: Implement get_viommu_cap() callback
>  vfio/iommufd: Force creating nested parent domain
>  hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool
>  intel_iommu: Introduce a new structure VTDHostIOMMUDevice
>  intel_iommu: Check for compatibility with IOMMUFD backed device when
>    x-flts=on
>  intel_iommu: Fail passthrough device under PCI bridge if x-flts=on
>  intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked
>  intel_iommu: Handle PASID entry removal and update
>  intel_iommu: Handle PASID entry addition
>  intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET
>  intel_iommu: Stick to system MR for IOMMUFD backed host device when
>    x-fls=on
>  intel_iommu: Bind/unbind guest page table to host
>  vfio: Add a new element bypass_ro in VFIOContainerBase
>  Workaround for ERRATA_772415_SPR17
>  intel_iommu: Enable host device when x-flts=on in scalable mode
>
> MAINTAINERS                           |   1 +
> hw/i386/intel_iommu_internal.h        |  68 +-
> include/hw/i386/intel_iommu.h         |   9 +-
> include/hw/iommu.h                    |  17 +
> include/hw/pci/pci.h                  |  27 +
> include/hw/vfio/vfio-container-base.h |   1 +
> hw/i386/intel_iommu.c                 | 941
>+++++++++++++++++++++++++-
> hw/pci/pci.c                          |  23 +-
> hw/vfio/iommufd.c                     |  22 +-
> hw/vfio/listener.c                    |  13 +-
> hw/i386/trace-events                  |   8 +
> 11 files changed, 1088 insertions(+), 42 deletions(-)
> create mode 100644 include/hw/iommu.h
>
>
>base-commit: 92c05be4dfb59a71033d4c57dac944b29f7dabf0
>--
>2.47.1


Reply via email to