Hi Donald, >-----Original Message----- >From: Donald Dutile <ddut...@redhat.com> >Subject: Re: [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for >passthrough device > >Zhenzhong, > >Hi! >Eric asked me to review this series. >Since it's rather late since you posted will summarize review feedback >below/bottom. > >- Don > >On 2/19/25 3:22 AM, Zhenzhong Duan wrote: >> Hi, >> >> Per Jason Wang's suggestion, iommufd nesting series[1] is split into >> "Enable stage-1 translation for emulated device" series and >> "Enable stage-1 translation for passthrough device" series. >> >> This series is 2nd part focusing on passthrough device. We don't do >> shadowing of guest page table for passthrough device but pass stage-1 >> page table to host side to construct a nested domain. There was some >> effort to enable this feature in old days, see [2] for details. >> >> The key design is to utilize the dual-stage IOMMU translation >> (also known as IOMMU nested translation) capability in host IOMMU. >> As the below diagram shows, guest I/O page table pointer in GPA >> (guest physical address) is passed to host and be used to perform >> the stage-1 address translation. Along with it, modifications to >> present mappings in the guest I/O page table should be followed >> with an IOTLB invalidation. >> >> .-------------. .---------------------------. >> | vIOMMU | | Guest I/O page table | >> | | '---------------------------' >> .----------------/ >> | PASID Entry |--- PASID cache flush --+ >> '-------------' | >> | | V >> | | I/O page table pointer in GPA >> '-------------' >> Guest >> ------| Shadow |---------------------------|-------- >> v v v >> Host >> .-------------. .------------------------. >> | pIOMMU | | FS for GIOVA->GPA | >> | | '------------------------' >> .----------------/ | >> | PASID Entry | V (Nested xlate) >> '----------------\.----------------------------------. >> | | | SS for GPA->HPA, unmanaged domain| >> | | '----------------------------------' >> '-------------' >> Where: >> - FS = First stage page tables >> - SS = Second stage page tables >> <Intel VT-d Nested translation> >> >I'd prefer the use of 's1' for stage1/First stage, and 's2' for stage2/second >stage. >We don't need different terms for the same technology in the iommu/iommufd >space(s).
OK, then I'd like to use stage1 and stage2 everywhere which is more verbose. > >> There are some interactions between VFIO and vIOMMU >> * vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI >> subsystem. VFIO calls them to register/unregister HostIOMMUDevice >> instance to vIOMMU at vfio device realize stage. >> * vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt >> to bind/unbind device to IOMMUFD backed domains, either nested >> domain or not. >> >> See below diagram: >> >> VFIO Device Intel IOMMU >> .-----------------. .-------------------. >> | | | | >> | .---------|PCIIOMMUOps |.-------------. | >> | | IOMMUFD |(set_iommu_device) || Host IOMMU | | >> | | Device |------------------------>|| Device list | | >> | .---------|(unset_iommu_device) |.-------------. | >> | | | | | >> | | | V | >> | .---------| HostIOMMUDeviceIOMMUFD | .-------------. | >> | | IOMMUFD | (attach_hwpt)| | Host IOMMU | | >> | | link |<------------------------| | Device | | >> | .---------| (detach_hwpt)| .-------------. | >> | | | | | >> | | | ... | >> .-----------------. .-------------------. >> >> Based on Yi's suggestion, this design is optimal in sharing ioas/hwpt >> whenever possible and create new one on demand, also supports multiple >> iommufd objects and ERRATA_772415. >> >> E.g., Stage-2 page table could be shared by different devices if there >> is no conflict and devices link to same iommufd object, i.e. devices >> under same host IOMMU can share same stage-2 page table. If there is >and 'devices under the same guest'. >Different guests cant be sharing the same stage-2 page tables. Yes, will update. > >> conflict, i.e. there is one device under non cache coherency mode >> which is different from others, it requires a separate stage-2 page >> table in non-CC mode. >> >> SPR platform has ERRATA_772415 which requires no readonly mappings >> in stage-2 page table. This series supports creating VTDIOASContainer >> with no readonly mappings. If there is a rare case that some IOMMUs >> on a multiple IOMMU host have ERRATA_772415 and others not, this >> design can still survive. >> >> See below example diagram for a full view: >> >> IntelIOMMUState >> | >> V >> .------------------. .------------------. .-------------------. >> | VTDIOASContainer |--->| VTDIOASContainer |--->| VTDIOASContainer |-- >>... >> | (iommufd0,RW&RO) | | (iommufd1,RW&RO) | | (iommufd0,RW only)| >> .------------------. .------------------. .-------------------. >> | | | >> | .-->... | >> V V >> .-------------------. .-------------------. >> .---------------. >> | VTDS2Hwpt(CC) |--->| VTDS2Hwpt(non-CC) |-->... | >> VTDS2Hwpt(CC) |- >->... >> .-------------------. .-------------------. >> .---------------. >> | | | | >> | | | | >> .-----------. .-----------. .------------. .------------. >> | IOMMUFD | | IOMMUFD | | IOMMUFD | | IOMMUFD | >> | Device(CC)| | Device(CC)| | Device | | Device(CC) | >> | (iommufd0)| | (iommufd0)| | (non-CC) | | (errata) | >> | | | | | (iommufd0) | | (iommufd0) | >> .-----------. .-----------. .------------. .------------. >> >> This series is also a prerequisite work for vSVA, i.e. Sharing >> guest application address space with passthrough devices. >> >> To enable stage-1 translation, only need to add >> "x-scalable-mode=on,x-flts=on". >> i.e. -device intel-iommu,x-scalable-mode=on,x-flts=on... >> >> Passthrough device should use iommufd backend to work with stage-1 >translation. >> i.e. -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,... >> >> If host doesn't support nested translation, qemu will fail with an >> unsupported >> report. >> >> Test done: >> - VFIO devices hotplug/unplug >> - different VFIO devices linked to different iommufds >> - vhost net device ping test >> >> PATCH1-8: Add HWPT-based nesting infrastructure support >> PATCH9-10: Some cleanup work >> PATCH11: cap/ecap related compatibility check between vIOMMU and Host >IOMMU >> PATCH12-19:Implement stage-1 page table for passthrough device >> PATCH20: Enable stage-1 translation for passthrough device >> >> Qemu code can be found at [3] >> >> TODO: >> - RAM discard >> - dirty tracking on stage-2 page table >> >> [1] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02740.html >> [2] https://patchwork.kernel.org/project/kvm/cover/20210302203827.437645- >1-yi.l....@intel.com/ >> [3] >https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_nesting_rfcv2 >> >> Thanks >> Zhenzhong >> >> Changelog: >> rfcv2: >> - Drop VTDPASIDAddressSpace and use VTDAddressSpace (Eric, Liuyi) >> - Move HWPT uAPI patches ahead(patch1-8) so arm nesting could easily rebase >> - add two cleanup patches(patch9-10) >> - VFIO passes iommufd/devid/hwpt_id to vIOMMU instead of >iommufd/devid/ioas_id >> - add vtd_as_[from|to]_iommu_pasid() helper to translate between vtd_as and >> iommu pasid, this is important for dropping VTDPASIDAddressSpace >> >> Yi Liu (3): >> intel_iommu: Replay pasid binds after context cache invalidation >> intel_iommu: Propagate PASID-based iotlb invalidation to host >> intel_iommu: Refresh pasid bind when either SRTP or TE bit is changed >> >> Zhenzhong Duan (17): >> backends/iommufd: Add helpers for invalidating user-managed HWPT >> vfio/iommufd: Add properties and handlers to >> TYPE_HOST_IOMMU_DEVICE_IOMMUFD >> HostIOMMUDevice: Introduce realize_late callback >> vfio/iommufd: Implement HostIOMMUDeviceClass::realize_late() handler >> vfio/iommufd: Implement [at|de]tach_hwpt handlers >> host_iommu_device: Define two new capabilities >> HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP] >> iommufd: Implement query of HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP] >> iommufd: Implement query of HOST_IOMMU_DEVICE_CAP_ERRATA >> intel_iommu: Rename vtd_ce_get_rid2pasid_entry to >> vtd_ce_get_pasid_entry >> intel_iommu: Optimize context entry cache utilization >> intel_iommu: Check for compatibility with IOMMUFD backed device when >> x-flts=on >> intel_iommu: Introduce a new structure VTDHostIOMMUDevice >> intel_iommu: Add PASID cache management infrastructure >> intel_iommu: Bind/unbind guest page table to host >> intel_iommu: ERRATA_772415 workaround >> intel_iommu: Bypass replay in stage-1 page table mode >> intel_iommu: Enable host device when x-flts=on in scalable mode >> >> hw/i386/intel_iommu_internal.h | 56 + >> include/hw/i386/intel_iommu.h | 33 +- >> include/system/host_iommu_device.h | 40 + >> include/system/iommufd.h | 53 + >> backends/iommufd.c | 58 + >> hw/i386/intel_iommu.c | 1660 ++++++++++++++++++++++++---- >> hw/vfio/common.c | 17 +- >> hw/vfio/iommufd.c | 48 + >> backends/trace-events | 1 + >> hw/i386/trace-events | 13 + >> 10 files changed, 1776 insertions(+), 203 deletions(-) >> >Relative to the patches: >Patch 1: As Eric eluded to, a proper description for the patch should be >written, >and the title should change 'helpers' to 'helper' This is addressed in [1] [1] https://github.com/yiliu1765/qemu/commits/zhenzhong/iommufd_nesting_rfcv3.wip/ > >Patch 2: >(1) Introduce 'realize_late()' interface, but leave the reader wondering 'ah, >why? >what?' ... after reading farther down the series, I learn more about >realize_late(), >but more on that later... realize_late() has been removed in [1] >(2) For my education, can you provide ptrs to VFIO & VPDA code paths that >demonstrate the need for different [at|de]tach_<>_hwpt() Google help me find this link https://lkml.iu.edu/2309.2/08079.html specially https://gitlab.com/lulu6/gitlabqemutmp/-/commit/354870ff6bd9bac80c9fc5c7f944331cb24b0331 > >Patch 3: Why can't the realize() be moved to after attach? isn't realize() >suppose >to indicate 'all is setup and object can now be used' -- apologies for what >could >be a dumb question, as that's my understanding of realize(). If the argument >is >such that there needs to be two steps, how does the first realize() that put >the >object into a used state <somehow> wait until realize_late()? > >Patch 4: Shouldn't the current/existing realize callback just be overwritten >with >the later one, when this is needed? realize_late() has been removed in [1] > >Patch 5: no issues. > >Patch 6: ewww -- fs1gp ... we use underlines all over the place for multi-word >elements; so how about 's1_1g_pg' > -- how many places is that really used that multiple underlines is > an issue? This is vtd specific, so to follow VTD spec's naming so that we can easily find its definition by searching fs1gp. > >Patch 7: intel-iommu-specific callbacks in the common vfio & iommufd-backend >code; nack. This won't compile w/intel-iommu included with iommufd... I think >backend, intel-iommu hw-caps should provide the generic 'caps' boolean-type >values/states; ... and maybe they should be extracted via vfio? .... like > case HOST_IOMMU_DEVICE_CAP_AW_BITS: > return vfio_device_get_aw_bits(hiod->agent); > >Patch 8: Again, VTD-specific code in IOMMUFD is a nack; again, maybe via vfio, >or a direct call into an iommu-device-cap api. Patch 7&8 addressed in [1] > >Patch 9: no issues. > >Patch 10: "except it's stale" likely "except when it's entry is stale" ? addressed in [1] > Did you ever put some tracing in to capture avg hits in cache? ... > if so, add >as a comment. > Otherwise, looks good. > >Patch 11: Apologies, I don't know what 'flts' stands for, and why it is >relative to 2- >stage mapping, or SIOV. Could you add verbage to explain the use of it, as the >rest of this patch doesn't make any sense to me without the background. >The patch introduces hw-info-type (none or intel), and then goes on to add a >large set of checks; seems like the caps & this checking should go together >(split >for each cap; put all caps together & the check...). OK, will do. There are some explanations in cover-letter. For history reason, old vtd spec define stage-1 as first level then switch to first stage. > >Patch 12: Why isn't HostIOMMUDevice extended to have another iommu-specif >element, opaque in HostIOMMUDevice, but set to specific IOMMU in use? e.g. >void *hostiommustate; Yes, that's possible, but we want to make a generic interface between VFIO/VDPA and vIOMMU. > >Patch 13: Isn't PASID just an extension/addition of BDF id? and doesn't each >PASID have its own address space? Yes, it is. >So, why isn't it handle via a uniqe AS cache like 'any other device'? Maybe >I'm >thinking too SMMU-StreamID, which can be varying length, depending on >subsystem support. I see what appears to be sid+pasid calls to do the AS >lookups; >hmm, so maybe this is the generalized BDF+pasid AS lookup? if so, maybe a >better description stating this transition to a wider stream-id would set the >code >context better. Not quite get.. >As for the rest of the (400 intel-iommu) code, I'm not that in-depth in >intel-iommu >to determine if its all correct or not. > >Patch 14: Define PGTT; the second paragraph seem self-contradicting -- it says >it >uses a 2-stage page table in each case, but it implies it should be different. > At 580 >lines of code changes, you win! ;-) The host side's using nested or only stage-2 page table depends on PGTT's setting in guest. > >Patch 15: Read-only and Read/write areas have different IOMMUFDs? is that an >intel-iommu requriement? > At least this intel-iommu-errata code is only in hw/i386/<> modules. No, if ERRATA_772415, read-only areas should not be mapped, so we allocate a new VTDIOASContainer to hold only read/write areas mapping. We can use same IOMMUFDs for different VTDIOASContainer. > >Patch 16: Looks reasonable. What does the 'SI' mean after "CACHE_DEV", >"CACHE_DOM" & "CACHE_PASID" ? -- stream-invalidation? VTD_PASID_CACHE_DEVSI stands for 'pasid cache device selective invalidation', VTD_PASID_CACHE_DOMSI means 'pasid cache domain selective invalidation'. Thanks Zhenzhong