Hi Nicolin, On 11/4/25 6:47 PM, Nicolin Chen wrote: > On Tue, Nov 04, 2025 at 05:01:57PM +0100, Eric Auger wrote: >>>>>> On 10/31/25 11:49 AM, Shameer Kolothum wrote: >>>>>>> On ARM, devices behind an IOMMU have their MSI doorbell addresses >>>>>>> translated by the IOMMU. In nested mode, this translation happens in >>>>>>> two stages (gIOVA → gPA → ITS page). >>>>>>> >>>>>>> In accelerated SMMUv3 mode, both stages are handled by hardware, so >>>>>>> get_address_space() returns the system address space so that VFIO >>>>>>> can setup stage-2 mappings for system address space. >>>>>> Sorry but I still don't catch the above. Can you explain (most probably >>>>>> again) why this is a requirement to return the system as so that VFIO >>>>>> can setup stage-2 mappings for system address space. I am sorry for >>>>>> insisting (at the risk of being stubborn or dumb) but I fail to >>>>>> understand the requirement. As far as I remember the way I integrated it >>>>>> at the old times did not require that change: >>>>>> https://lore.kernel.org/all/20210411120912.15770-1- >>>>>> [email protected]/ >>>>>> I used a vfio_prereg_listener to force the S2 mapping. >>>>> Yes I remember that. >>>>> >>>>>> What has changed that forces us now to have this gym >>>>> This approach achieves the same outcome, but through a >>>>> different mechanism. Returning the system address space >>>>> here ensures that VFIO sets up the Stage-2 mappings for >>>>> devices behind the accelerated SMMUv3. >>>>> >>>>> I think, this makes sense because, in the accelerated case, the >>>>> device is no longer managed by QEMU’s SMMUv3 model. The >>>> On the other hand, as we discussed on v4 by returning system as you >>>> pretend there is no translation in place which is not true. Now we use >>>> an alias for it but it has not really removed its usage. Also it forces >>>> use to hack around the MSI mapping and introduce new PCIIOMMUOps. >>>> Have >>>> you assessed the feasability of using vfio_prereg_listener to force the >>>> S2 mapping. Is it simply not relevant anymore or could it be used also >>>> with the iommufd be integration? Eric >>> IIUC, the prereg_listener mechanism just enables us to setup the s2 >>> mappings. For MSI, In your version, I see that smmu_find_add_as() >>> always returns IOMMU as. How is that supposed to work if the Guest >>> has s1 bypass mode STE for the device? >> I need to delve into it again as I forgot the details. Will come back to >> you ... > We aligned with Intel previously about this system address space. > You might know these very well, yet here are the breakdowns: > > 1. VFIO core has a container that manages an HWPT. By default, it > allocates a stage-1 normal HWPT, unless vIOMMU requests for a You may precise this stage-1 normal HWPT is used to map GPA to HPA (so eventually implements stage 2). > nesting parent HWPT for accelerated cases. > 2. VFIO core adds a listener for that HWPT and sets up a handler > vfio_container_region_add() where it checks the memory region > whether it is iommu or not. > a. In case of !IOMMU as (i.e. system address space), it treats > the address space as a RAM region, and handles all stage-2 > mappings for the core allocated nesting parent HWPT. > b. In case of IOMMU as (i.e. a translation type) it sets up > the IOTLB notifier and translation replay while bypassing > the listener for RAM region. yes S1+S2 are combined through vfio_iommu_map_notify() > > In an accelerated case, we need stage-2 mappings to match with the > nesting parent HWPT. So, returning system address space or an alias > of that notifies the vfio core to take the 2.a path. > > If we take 2.b path by returning IOMMU as in smmu_find_add_as, the > VFIO core would no longer listen to the RAM region for us, i.e. no > stage-2 HWPT nor mappings. vIOMMU would have to allocate a nesting except if you change the VFIO common.c as I did the past to force the S2 mapping in the nested config. See https://lore.kernel.org/all/[email protected]/ and vfio_prereg_listener() Again I do not say this is the right way to do but using system address space is not the "only" implementation choice I think and it needs to be properly justified, especially has it has at least 2 side effects: - somehow abusing the semantic of returned address space and pretends there is no IOMMU translation in place and - also impacting the way MSIs are handled (introduction of a new PCIIOMMUOps). This kind of explanation you wrote is absolutely needed in the commit msg for reviewers to understand the design choice I think.
Eric > parent and manage the stage-2 mappings by adding a listener in its > own code, which is largely duplicated with the core code. > > -------------- so far this works for Intel and ARM-------------- > > 3. On ARM, vPCI device is programmed with gIOVA, so KVM has to > follow what the vPCI is told to inject vIRQs. This requires > a translation at the nested stage-1 address space. Note that > vSMMU in this case doesn't manage translation as it doesn't > need to. But there is no other sane way for KVM to know the > vITS page corresponding to the given gIOVA. So, we invented > the get_msi_address_space op. > > (3) makes sense because there is a complication in the MSI that > does a 2-stage translation on ARM and KVM must follow the stage-1 > input address, leaving us no choice to have two address spaces. > > Thanks > Nicolin >
