Hi Donald,

>-----Original Message-----
>From: Donald Dutile <ddut...@redhat.com>
>Subject: Re: [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for
>passthrough device
>
>Zhenzhong,
>
>Hi!
>Eric asked me to review this series.
>Since it's rather late since you posted will summarize review feedback
>below/bottom.
>
>- Don
>
>On 2/19/25 3:22 AM, Zhenzhong Duan wrote:
>> Hi,
>>
>> Per Jason Wang's suggestion, iommufd nesting series[1] is split into
>> "Enable stage-1 translation for emulated device" series and
>> "Enable stage-1 translation for passthrough device" series.
>>
>> This series is 2nd part focusing on passthrough device. We don't do
>> shadowing of guest page table for passthrough device but pass stage-1
>> page table to host side to construct a nested domain. There was some
>> effort to enable this feature in old days, see [2] for details.
>>
>> The key design is to utilize the dual-stage IOMMU translation
>> (also known as IOMMU nested translation) capability in host IOMMU.
>> As the below diagram shows, guest I/O page table pointer in GPA
>> (guest physical address) is passed to host and be used to perform
>> the stage-1 address translation. Along with it, modifications to
>> present mappings in the guest I/O page table should be followed
>> with an IOTLB invalidation.
>>
>>          .-------------.  .---------------------------.
>>          |   vIOMMU    |  | Guest I/O page table      |
>>          |             |  '---------------------------'
>>          .----------------/
>>          | PASID Entry |--- PASID cache flush --+
>>          '-------------'                        |
>>          |             |                        V
>>          |             |           I/O page table pointer in GPA
>>          '-------------'
>>      Guest
>>      ------| Shadow |---------------------------|--------
>>            v        v                           v
>>      Host
>>          .-------------.  .------------------------.
>>          |   pIOMMU    |  |  FS for GIOVA->GPA     |
>>          |             |  '------------------------'
>>          .----------------/  |
>>          | PASID Entry |     V (Nested xlate)
>>          '----------------\.----------------------------------.
>>          |             |   | SS for GPA->HPA, unmanaged domain|
>>          |             |   '----------------------------------'
>>          '-------------'
>> Where:
>>   - FS = First stage page tables
>>   - SS = Second stage page tables
>> <Intel VT-d Nested translation>
>>
>I'd prefer the use of 's1' for stage1/First stage, and 's2' for stage2/second 
>stage.
>We don't need different terms for the same technology in the iommu/iommufd
>space(s).

OK, then I'd like to use stage1 and stage2 everywhere which is more verbose.

>
>> There are some interactions between VFIO and vIOMMU
>> * vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI
>>    subsystem. VFIO calls them to register/unregister HostIOMMUDevice
>>    instance to vIOMMU at vfio device realize stage.
>> * vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt
>>    to bind/unbind device to IOMMUFD backed domains, either nested
>>    domain or not.
>>
>> See below diagram:
>>
>>          VFIO Device                                 Intel IOMMU
>>      .-----------------.                         .-------------------.
>>      |                 |                         |                   |
>>      |       .---------|PCIIOMMUOps              |.-------------.    |
>>      |       | IOMMUFD |(set_iommu_device)       || Host IOMMU  |    |
>>      |       | Device  |------------------------>|| Device list |    |
>>      |       .---------|(unset_iommu_device)     |.-------------.    |
>>      |                 |                         |       |           |
>>      |                 |                         |       V           |
>>      |       .---------|  HostIOMMUDeviceIOMMUFD |  .-------------.  |
>>      |       | IOMMUFD |            (attach_hwpt)|  | Host IOMMU  |  |
>>      |       | link    |<------------------------|  |   Device    |  |
>>      |       .---------|            (detach_hwpt)|  .-------------.  |
>>      |                 |                         |       |           |
>>      |                 |                         |       ...         |
>>      .-----------------.                         .-------------------.
>>
>> Based on Yi's suggestion, this design is optimal in sharing ioas/hwpt
>> whenever possible and create new one on demand, also supports multiple
>> iommufd objects and ERRATA_772415.
>>
>> E.g., Stage-2 page table could be shared by different devices if there
>> is no conflict and devices link to same iommufd object, i.e. devices
>> under same host IOMMU can share same stage-2 page table. If there is
>and 'devices under the same guest'.
>Different guests cant be sharing the same stage-2 page tables.

Yes, will update.

>
>> conflict, i.e. there is one device under non cache coherency mode
>> which is different from others, it requires a separate stage-2 page
>> table in non-CC mode.
>>
>> SPR platform has ERRATA_772415 which requires no readonly mappings
>> in stage-2 page table. This series supports creating VTDIOASContainer
>> with no readonly mappings. If there is a rare case that some IOMMUs
>> on a multiple IOMMU host have ERRATA_772415 and others not, this
>> design can still survive.
>>
>> See below example diagram for a full view:
>>
>>        IntelIOMMUState
>>               |
>>               V
>>      .------------------.    .------------------.    .-------------------.
>>      | VTDIOASContainer |--->| VTDIOASContainer |--->| VTDIOASContainer  |--
>>...
>>      | (iommufd0,RW&RO) |    | (iommufd1,RW&RO) |    | (iommufd0,RW only)|
>>      .------------------.    .------------------.    .-------------------.
>>               |                       |                              |
>>               |                       .-->...                        |
>>               V                                                      V
>>        .-------------------.    .-------------------.          
>> .---------------.
>>        |   VTDS2Hwpt(CC)   |--->| VTDS2Hwpt(non-CC) |-->...    | 
>> VTDS2Hwpt(CC) |-
>->...
>>        .-------------------.    .-------------------.          
>> .---------------.
>>            |            |               |                            |
>>            |            |               |                            |
>>      .-----------.  .-----------.  .------------.              .------------.
>>      | IOMMUFD   |  | IOMMUFD   |  | IOMMUFD    |              | IOMMUFD    |
>>      | Device(CC)|  | Device(CC)|  | Device     |              | Device(CC) |
>>      | (iommufd0)|  | (iommufd0)|  | (non-CC)   |              | (errata)   |
>>      |           |  |           |  | (iommufd0) |              | (iommufd0) |
>>      .-----------.  .-----------.  .------------.              .------------.
>>
>> This series is also a prerequisite work for vSVA, i.e. Sharing
>> guest application address space with passthrough devices.
>>
>> To enable stage-1 translation, only need to add 
>> "x-scalable-mode=on,x-flts=on".
>> i.e. -device intel-iommu,x-scalable-mode=on,x-flts=on...
>>
>> Passthrough device should use iommufd backend to work with stage-1
>translation.
>> i.e. -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,...
>>
>> If host doesn't support nested translation, qemu will fail with an 
>> unsupported
>> report.
>>
>> Test done:
>> - VFIO devices hotplug/unplug
>> - different VFIO devices linked to different iommufds
>> - vhost net device ping test
>>
>> PATCH1-8:  Add HWPT-based nesting infrastructure support
>> PATCH9-10: Some cleanup work
>> PATCH11:   cap/ecap related compatibility check between vIOMMU and Host
>IOMMU
>> PATCH12-19:Implement stage-1 page table for passthrough device
>> PATCH20:   Enable stage-1 translation for passthrough device
>>
>> Qemu code can be found at [3]
>>
>> TODO:
>> - RAM discard
>> - dirty tracking on stage-2 page table
>>
>> [1] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02740.html
>> [2] https://patchwork.kernel.org/project/kvm/cover/20210302203827.437645-
>1-yi.l....@intel.com/
>> [3]
>https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_nesting_rfcv2
>>
>> Thanks
>> Zhenzhong
>>
>> Changelog:
>> rfcv2:
>> - Drop VTDPASIDAddressSpace and use VTDAddressSpace (Eric, Liuyi)
>> - Move HWPT uAPI patches ahead(patch1-8) so arm nesting could easily rebase
>> - add two cleanup patches(patch9-10)
>> - VFIO passes iommufd/devid/hwpt_id to vIOMMU instead of
>iommufd/devid/ioas_id
>> - add vtd_as_[from|to]_iommu_pasid() helper to translate between vtd_as and
>>    iommu pasid, this is important for dropping VTDPASIDAddressSpace
>>
>> Yi Liu (3):
>>    intel_iommu: Replay pasid binds after context cache invalidation
>>    intel_iommu: Propagate PASID-based iotlb invalidation to host
>>    intel_iommu: Refresh pasid bind when either SRTP or TE bit is changed
>>
>> Zhenzhong Duan (17):
>>    backends/iommufd: Add helpers for invalidating user-managed HWPT
>>    vfio/iommufd: Add properties and handlers to
>>      TYPE_HOST_IOMMU_DEVICE_IOMMUFD
>>    HostIOMMUDevice: Introduce realize_late callback
>>    vfio/iommufd: Implement HostIOMMUDeviceClass::realize_late() handler
>>    vfio/iommufd: Implement [at|de]tach_hwpt handlers
>>    host_iommu_device: Define two new capabilities
>>      HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP]
>>    iommufd: Implement query of HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP]
>>    iommufd: Implement query of HOST_IOMMU_DEVICE_CAP_ERRATA
>>    intel_iommu: Rename vtd_ce_get_rid2pasid_entry to
>>      vtd_ce_get_pasid_entry
>>    intel_iommu: Optimize context entry cache utilization
>>    intel_iommu: Check for compatibility with IOMMUFD backed device when
>>      x-flts=on
>>    intel_iommu: Introduce a new structure VTDHostIOMMUDevice
>>    intel_iommu: Add PASID cache management infrastructure
>>    intel_iommu: Bind/unbind guest page table to host
>>    intel_iommu: ERRATA_772415 workaround
>>    intel_iommu: Bypass replay in stage-1 page table mode
>>    intel_iommu: Enable host device when x-flts=on in scalable mode
>>
>>   hw/i386/intel_iommu_internal.h     |   56 +
>>   include/hw/i386/intel_iommu.h      |   33 +-
>>   include/system/host_iommu_device.h |   40 +
>>   include/system/iommufd.h           |   53 +
>>   backends/iommufd.c                 |   58 +
>>   hw/i386/intel_iommu.c              | 1660 ++++++++++++++++++++++++----
>>   hw/vfio/common.c                   |   17 +-
>>   hw/vfio/iommufd.c                  |   48 +
>>   backends/trace-events              |    1 +
>>   hw/i386/trace-events               |   13 +
>>   10 files changed, 1776 insertions(+), 203 deletions(-)
>>
>Relative to the patches:
>Patch 1: As Eric eluded to, a proper description for the patch should be 
>written,
>and the title should change 'helpers' to 'helper'

This is addressed in [1]

[1] 
https://github.com/yiliu1765/qemu/commits/zhenzhong/iommufd_nesting_rfcv3.wip/

>
>Patch 2:
>(1) Introduce 'realize_late()' interface, but leave the reader wondering 'ah, 
>why?
>what?' ... after reading farther down the series, I learn more about 
>realize_late(),
>but more on that later...

realize_late() has been removed in [1]

>(2) For my education, can you provide ptrs to VFIO & VPDA code paths that
>demonstrate the need for different [at|de]tach_<>_hwpt()

Google help me find this link https://lkml.iu.edu/2309.2/08079.html
specially 
https://gitlab.com/lulu6/gitlabqemutmp/-/commit/354870ff6bd9bac80c9fc5c7f944331cb24b0331

>
>Patch 3: Why can't the realize() be moved to after attach?  isn't realize() 
>suppose
>to indicate 'all is setup and object can now be used' -- apologies for what 
>could
>be a dumb question, as that's my understanding of realize().  If the argument 
>is
>such that there needs to be two steps, how does the first realize() that put 
>the
>object into a used state <somehow> wait until realize_late()?
>
>Patch 4: Shouldn't the current/existing realize callback just be overwritten 
>with
>the later one, when this is needed?

realize_late() has been removed in [1]

>
>Patch 5: no issues.
>
>Patch 6: ewww -- fs1gp ... we use underlines all over the place for multi-word
>elements; so how about 's1_1g_pg'
>           -- how many places is that really used that multiple underlines is 
> an issue?

This is vtd specific, so to follow VTD spec's naming so that we can easily find 
its definition by searching fs1gp.

>
>Patch 7: intel-iommu-specific callbacks in the common vfio & iommufd-backend
>code; nack. This won't compile w/intel-iommu included with iommufd... I think
>backend, intel-iommu hw-caps should provide the generic 'caps' boolean-type
>values/states; ... and maybe they should be extracted via vfio? .... like
>      case HOST_IOMMU_DEVICE_CAP_AW_BITS:
>          return vfio_device_get_aw_bits(hiod->agent);
>
>Patch 8: Again, VTD-specific code in IOMMUFD is a nack; again, maybe via vfio,
>or a direct call into an iommu-device-cap api.

Patch 7&8 addressed in [1]

>
>Patch 9: no issues.
>
>Patch 10: "except it's stale"  likely "except when it's entry is stale" ?

addressed in [1]

>           Did you ever put some tracing in to capture avg hits in cache? ... 
> if so, add
>as a comment.
>           Otherwise, looks good.
>
>Patch 11: Apologies, I don't know what 'flts' stands for, and why it is 
>relative to 2-
>stage mapping, or SIOV.  Could you add verbage to explain the use of it, as the
>rest of this patch doesn't make any sense to me without the background.
>The patch introduces hw-info-type (none or intel), and then goes on to add a
>large set of checks; seems like the caps & this checking should go together 
>(split
>for each cap; put all caps together & the check...).

OK, will do. There are some explanations in cover-letter.
For history reason, old vtd spec define stage-1 as first level then switch to 
first stage.

>
>Patch 12: Why isn't HostIOMMUDevice extended to have another iommu-specif
>element, opaque in HostIOMMUDevice, but set to specific IOMMU in use?   e.g.
>void *hostiommustate;

Yes, that's possible, but we want to make a generic interface between VFIO/VDPA 
and vIOMMU.

>
>Patch 13: Isn't PASID just an extension/addition of BDF id? and doesn't each
>PASID have its own address space?

Yes, it is.

>So, why isn't it handle via a uniqe AS cache like 'any other device'?  Maybe 
>I'm
>thinking too SMMU-StreamID, which can be varying length, depending on
>subsystem support.  I see what appears to be sid+pasid calls to do the AS 
>lookups;
>hmm, so maybe this is the generalized BDF+pasid AS lookup?  if so, maybe a
>better description stating this transition to a wider stream-id would set the 
>code
>context better.

Not quite get..

>As for the rest of the (400 intel-iommu) code, I'm not that in-depth in 
>intel-iommu
>to determine if its all correct or not.
>
>Patch 14: Define PGTT; the second paragraph seem self-contradicting -- it says 
>it
>uses a 2-stage page table in each case, but it implies it should be different. 
> At 580
>lines of code changes, you win! ;-)

The host side's using nested or only stage-2 page table depends on PGTT's 
setting in guest.

>
>Patch 15: Read-only and Read/write areas have different IOMMUFDs?  is that an
>intel-iommu requriement?
>           At least this intel-iommu-errata code is only in hw/i386/<> modules.

No, if ERRATA_772415, read-only areas should not be mapped, so we allocate a 
new VTDIOASContainer to hold only read/write areas mapping.
We can use same IOMMUFDs for different VTDIOASContainer.

>
>Patch 16: Looks reasonable.  What does the 'SI' mean after "CACHE_DEV",
>"CACHE_DOM" & "CACHE_PASID" ? -- stream-invalidation?

VTD_PASID_CACHE_DEVSI stands for 'pasid cache device selective invalidation',
VTD_PASID_CACHE_DOMSI means 'pasid cache domain selective invalidation'.

Thanks
Zhenzhong

Reply via email to