>-----Original Message-----
>From: Cédric Le Goater <c...@redhat.com>
>Subject: Re: [PATCH rfcv3 00/21] intel_iommu: Enable stage-1 translation for
>passthrough device
>
>On 5/21/25 13:14, Zhenzhong Duan wrote:
>> Hi,
>>
>> Per Jason Wang's suggestion, iommufd nesting series[1] is split into
>> "Enable stage-1 translation for emulated device" series and
>> "Enable stage-1 translation for passthrough device" series.
>>
>> This series is 2nd part focusing on passthrough device. We don't do
>> shadowing of guest page table for passthrough device but pass stage-1
>> page table to host side to construct a nested domain. There was some
>> effort to enable this feature in old days, see [2] for details.
>>
>> The key design is to utilize the dual-stage IOMMU translation
>> (also known as IOMMU nested translation) capability in host IOMMU.
>> As the below diagram shows, guest I/O page table pointer in GPA
>> (guest physical address) is passed to host and be used to perform
>> the stage-1 address translation. Along with it, modifications to
>> present mappings in the guest I/O page table should be followed
>> with an IOTLB invalidation.
>>
>>          .-------------.  .---------------------------.
>>          |   vIOMMU    |  | Guest I/O page table      |
>>          |             |  '---------------------------'
>>          .----------------/
>>          | PASID Entry |--- PASID cache flush --+
>>          '-------------'                        |
>>          |             |                        V
>>          |             |           I/O page table pointer in GPA
>>          '-------------'
>>      Guest
>>      ------| Shadow |---------------------------|--------
>>            v        v                           v
>>      Host
>>          .-------------.  .------------------------.
>>          |   pIOMMU    |  | Stage1 for GIOVA->GPA  |
>>          |             |  '------------------------'
>>          .----------------/  |
>>          | PASID Entry |     V (Nested xlate)
>>          '----------------\.--------------------------------------.
>>          |             |   | Stage2 for GPA->HPA, unmanaged domain|
>>          |             |   '--------------------------------------'
>>          '-------------'
>> For history reason, there are different namings in different VTD spec rev,
>> Where:
>>   - Stage1 = First stage = First level = flts
>>   - Stage2 = Second stage = Second level = slts
>> <Intel VT-d Nested translation>
>>
>> There are some interactions between VFIO and vIOMMU
>> * vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI
>>    subsystem. VFIO calls them to register/unregister HostIOMMUDevice
>>    instance to vIOMMU at vfio device realize stage.
>> * vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt
>>    to bind/unbind device to IOMMUFD backed domains, either nested
>>    domain or not.
>>
>> See below diagram:
>>
>>          VFIO Device                                 Intel IOMMU
>>      .-----------------.                         .-------------------.
>>      |                 |                         |                   |
>>      |       .---------|PCIIOMMUOps              |.-------------.    |
>>      |       | IOMMUFD |(set_iommu_device)       || Host IOMMU  |    |
>>      |       | Device  |------------------------>|| Device list |    |
>>      |       .---------|(unset_iommu_device)     |.-------------.    |
>>      |                 |                         |       |           |
>>      |                 |                         |       V           |
>>      |       .---------|  HostIOMMUDeviceIOMMUFD |  .-------------.  |
>>      |       | IOMMUFD |            (attach_hwpt)|  | Host IOMMU  |  |
>>      |       | link    |<------------------------|  |   Device    |  |
>>      |       .---------|            (detach_hwpt)|  .-------------.  |
>>      |                 |                         |       |           |
>>      |                 |                         |       ...         |
>>      .-----------------.                         .-------------------.
>>
>> Based on Yi's suggestion, this design is optimal in sharing ioas/hwpt
>> whenever possible and create new one on demand, also supports multiple
>> iommufd objects and ERRATA_772415.
>>
>> E.g., Under one guest's scope, Stage-2 page table could be shared by 
>> different
>> devices if there is no conflict and devices link to same iommufd object,
>> i.e. devices under same host IOMMU can share same stage-2 page table. If
>there
>> is conflict, i.e. there is one device under non cache coherency mode which is
>> different from others, it requires a separate stage-2 page table in non-CC 
>> mode.
>>
>> SPR platform has ERRATA_772415 which requires no readonly mappings
>> in stage-2 page table. This series supports creating VTDIOASContainer
>> with no readonly mappings. If there is a rare case that some IOMMUs
>> on a multiple IOMMU host have ERRATA_772415 and others not, this
>> design can still survive.
>>
>> See below example diagram for a full view:
>>
>>        IntelIOMMUState
>>               |
>>               V
>>      .------------------.    .------------------.    .-------------------.
>>      | VTDIOASContainer |--->| VTDIOASContainer |--->| VTDIOASContainer  |--
>>...
>>      | (iommufd0,RW&RO) |    | (iommufd1,RW&RO) |    | (iommufd0,only RW)|
>>      .------------------.    .------------------.    .-------------------.
>>               |                       |                              |
>>               |                       .-->...                        |
>>               V                                                      V
>>        .-------------------.    .-------------------.          
>> .---------------.
>>        |   VTDS2Hwpt(CC)   |--->| VTDS2Hwpt(non-CC) |-->...    | 
>> VTDS2Hwpt(CC) |-
>->...
>>        .-------------------.    .-------------------.          
>> .---------------.
>>            |            |               |                            |
>>            |            |               |                            |
>>      .-----------.  .-----------.  .------------.              .------------.
>>      | IOMMUFD   |  | IOMMUFD   |  | IOMMUFD    |              | IOMMUFD    |
>>      | Device(CC)|  | Device(CC)|  | Device     |              | Device(CC) |
>>      | (iommufd0)|  | (iommufd0)|  | (non-CC)   |              | (errata)   |
>>      |           |  |           |  | (iommufd0) |              | (iommufd0) |
>>      .-----------.  .-----------.  .------------.              .------------.
>>
>> This series is also a prerequisite work for vSVA, i.e. Sharing
>> guest application address space with passthrough devices.
>>
>> To enable stage-1 translation, only need to add 
>> "x-scalable-mode=on,x-flts=on".
>> i.e. -device intel-iommu,x-scalable-mode=on,x-flts=on...
>>
>> Passthrough device should use iommufd backend to work with stage-1
>translation.
>> i.e. -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,...
>>
>> If host doesn't support nested translation, qemu will fail with an 
>> unsupported
>> report.
>>
>> Test done:
>> - VFIO devices hotplug/unplug
>> - different VFIO devices linked to different iommufds
>> - vhost net device ping test
>>
>> Fault report isn't supported in this series, we presume guest kernel always
>> construct correct S1 page table for passthrough device. For emulated devices,
>> the emulation code already provided S1 fault injection.
>>
>> PATCH1-6:  Add HWPT-based nesting infrastructure support
>
>The first 6 patches are all VFIO or IOMMUFD related. They are
>mostly  additions and I didn't see anything wrong. They could
>be merged in advance through the VFIO tree.

OK, I'll send a prerequisite series containing only the first 6 patches
with suggested changes recently.

Thanks
Zhenzhong

Reply via email to