On Tue, Nov 04, 2025 at 05:01:57PM +0100, Eric Auger wrote:
> >>>> On 10/31/25 11:49 AM, Shameer Kolothum wrote:
> >>>>> On ARM, devices behind an IOMMU have their MSI doorbell addresses
> >>>>> translated by the IOMMU. In nested mode, this translation happens in
> >>>>> two stages (gIOVA → gPA → ITS page).
> >>>>>
> >>>>> In accelerated SMMUv3 mode, both stages are handled by hardware, so
> >>>>> get_address_space() returns the system address space so that VFIO
> >>>>> can setup stage-2 mappings for system address space.
> >>>> Sorry but I still don't catch the above. Can you explain (most probably
> >>>> again) why this is a requirement to return the system as so that VFIO
> >>>> can setup stage-2 mappings for system address space. I am sorry for
> >>>> insisting (at the risk of being stubborn or dumb) but I fail to
> >>>> understand the requirement. As far as I remember the way I integrated it
> >>>> at the old times did not require that change:
> >>>> https://lore.kernel.org/all/20210411120912.15770-1-
> >>>> [email protected]/
> >>>> I used a vfio_prereg_listener to force the S2 mapping.
> >>> Yes I remember that.
> >>>
> >>>> What has changed that forces us now to have this gym
> >>> This approach achieves the same outcome, but through a
> >>> different mechanism. Returning the system address space
> >>> here ensures that VFIO sets up the Stage-2 mappings for
> >>> devices behind the accelerated SMMUv3.
> >>>
> >>> I think, this makes sense because, in the accelerated case, the
> >>> device is no longer managed by QEMU’s SMMUv3 model. The
> >> On the other hand, as we discussed on v4 by returning system as you
> >> pretend there is no translation in place which is not true. Now we use
> >> an alias for it but it has not really removed its usage. Also it forces
> >> use to hack around the MSI mapping and introduce new PCIIOMMUOps.
> >> Have
> >> you assessed the feasability of using vfio_prereg_listener to force the
> >> S2 mapping. Is it simply not relevant anymore or could it be used also
> >> with the iommufd be integration? Eric
> > IIUC, the prereg_listener mechanism just enables us to setup the s2
> > mappings. For MSI, In your version, I see that smmu_find_add_as()
> > always returns IOMMU as. How is that supposed to work if the Guest
> > has s1 bypass mode STE for the device?
>
> I need to delve into it again as I forgot the details. Will come back to
> you ...
We aligned with Intel previously about this system address space.
You might know these very well, yet here are the breakdowns:
1. VFIO core has a container that manages an HWPT. By default, it
allocates a stage-1 normal HWPT, unless vIOMMU requests for a
nesting parent HWPT for accelerated cases.
2. VFIO core adds a listener for that HWPT and sets up a handler
vfio_container_region_add() where it checks the memory region
whether it is iommu or not.
a. In case of !IOMMU as (i.e. system address space), it treats
the address space as a RAM region, and handles all stage-2
mappings for the core allocated nesting parent HWPT.
b. In case of IOMMU as (i.e. a translation type) it sets up
the IOTLB notifier and translation replay while bypassing
the listener for RAM region.
In an accelerated case, we need stage-2 mappings to match with the
nesting parent HWPT. So, returning system address space or an alias
of that notifies the vfio core to take the 2.a path.
If we take 2.b path by returning IOMMU as in smmu_find_add_as, the
VFIO core would no longer listen to the RAM region for us, i.e. no
stage-2 HWPT nor mappings. vIOMMU would have to allocate a nesting
parent and manage the stage-2 mappings by adding a listener in its
own code, which is largely duplicated with the core code.
-------------- so far this works for Intel and ARM--------------
3. On ARM, vPCI device is programmed with gIOVA, so KVM has to
follow what the vPCI is told to inject vIRQs. This requires
a translation at the nested stage-1 address space. Note that
vSMMU in this case doesn't manage translation as it doesn't
need to. But there is no other sane way for KVM to know the
vITS page corresponding to the given gIOVA. So, we invented
the get_msi_address_space op.
(3) makes sense because there is a complication in the MSI that
does a 2-stage translation on ARM and KVM must follow the stage-1
input address, leaving us no choice to have two address spaces.
Thanks
Nicolin