Hi Nicolin,

On 11/4/25 6:47 PM, Nicolin Chen wrote:
> On Tue, Nov 04, 2025 at 05:01:57PM +0100, Eric Auger wrote:
>>>>>> On 10/31/25 11:49 AM, Shameer Kolothum wrote:
>>>>>>> On ARM, devices behind an IOMMU have their MSI doorbell addresses
>>>>>>> translated by the IOMMU. In nested mode, this translation happens in
>>>>>>> two stages (gIOVA → gPA → ITS page).
>>>>>>>
>>>>>>> In accelerated SMMUv3 mode, both stages are handled by hardware, so
>>>>>>> get_address_space() returns the system address space so that VFIO
>>>>>>> can setup stage-2 mappings for system address space.
>>>>>> Sorry but I still don't catch the above. Can you explain (most probably
>>>>>> again) why this is a requirement to return the system as so that VFIO
>>>>>> can setup stage-2 mappings for system address space. I am sorry for
>>>>>> insisting (at the risk of being stubborn or dumb) but I fail to
>>>>>> understand the requirement. As far as I remember the way I integrated it
>>>>>> at the old times did not require that change:
>>>>>> https://lore.kernel.org/all/20210411120912.15770-1-
>>>>>> [email protected]/
>>>>>> I used a vfio_prereg_listener to force the S2 mapping.
>>>>> Yes I remember that.
>>>>>
>>>>>> What has changed that forces us now to have this gym
>>>>> This approach achieves the same outcome, but through a
>>>>> different mechanism. Returning the system address space
>>>>> here ensures that VFIO sets up the Stage-2 mappings for
>>>>> devices behind the accelerated SMMUv3.
>>>>>
>>>>> I think, this makes sense because, in the accelerated case, the
>>>>> device is no longer managed by QEMU’s SMMUv3 model. The
>>>> On the other hand, as we discussed on v4 by returning system as you
>>>> pretend there is no translation in place which is not true. Now we use
>>>> an alias for it but it has not really removed its usage. Also it forces
>>>> use to hack around the MSI mapping and introduce new PCIIOMMUOps.
>>>> Have
>>>> you assessed the feasability of using vfio_prereg_listener to force the
>>>> S2 mapping. Is it simply not relevant anymore or could it be used also
>>>> with the iommufd be integration? Eric
>>> IIUC, the prereg_listener mechanism just enables us to setup the s2
>>> mappings. For MSI, In your version, I see that smmu_find_add_as()
>>> always returns IOMMU as. How is that supposed to work if the Guest
>>> has s1 bypass mode STE for the device?
>> I need to delve into it again as I forgot the details. Will come back to
>> you ...
> We aligned with Intel previously about this system address space.
> You might know these very well, yet here are the breakdowns:
>
> 1. VFIO core has a container that manages an HWPT. By default, it
>    allocates a stage-1 normal HWPT, unless vIOMMU requests for a
You may precise this stage-1 normal HWPT is used to map GPA to HPA (so
eventually implements stage 2).
>    nesting parent HWPT for accelerated cases.
> 2. VFIO core adds a listener for that HWPT and sets up a handler
>    vfio_container_region_add() where it checks the memory region
>    whether it is iommu or not.
>    a. In case of !IOMMU as (i.e. system address space), it treats
>       the address space as a RAM region, and handles all stage-2
>       mappings for the core allocated nesting parent HWPT.
>    b. In case of IOMMU as (i.e. a translation type) it sets up
>       the IOTLB notifier and translation replay while bypassing
>       the listener for RAM region.
yes S1+S2 are combined through vfio_iommu_map_notify()
>
> In an accelerated case, we need stage-2 mappings to match with the
> nesting parent HWPT. So, returning system address space or an alias
> of that notifies the vfio core to take the 2.a path.
>
> If we take 2.b path by returning IOMMU as in smmu_find_add_as, the
> VFIO core would no longer listen to the RAM region for us, i.e. no
> stage-2 HWPT nor mappings. vIOMMU would have to allocate a nesting
except if you change the VFIO common.c as I did the past to force the S2
mapping in the nested config.
See
https://lore.kernel.org/all/[email protected]/
and vfio_prereg_listener()
Again I do not say this is the right way to do but using system address
space is not the "only" implementation choice I think and it needs to be
properly justified, especially has it has at least 2 side effects:
- somehow abusing the semantic of returned address space and pretends
there is no IOMMU translation in place and
- also impacting the way MSIs are handled (introduction of a new
PCIIOMMUOps).
This kind of explanation you wrote is absolutely needed in the commit
msg for reviewers to understand the design choice I think.

Eric
> parent and manage the stage-2 mappings by adding a listener in its
> own code, which is largely duplicated with the core code.
>
> -------------- so far this works for Intel and ARM--------------
>
> 3. On ARM, vPCI device is programmed with gIOVA, so KVM has to
>    follow what the vPCI is told to inject vIRQs. This requires
>    a translation at the nested stage-1 address space. Note that
>    vSMMU in this case doesn't manage translation as it doesn't
>    need to. But there is no other sane way for KVM to know the
>    vITS page corresponding to the given gIOVA. So, we invented
>    the get_msi_address_space op.
>
> (3) makes sense because there is a complication in the MSI that
> does a 2-stage translation on ARM and KVM must follow the stage-1
> input address, leaving us no choice to have two address spaces.
>
> Thanks
> Nicolin
>


Reply via email to