Hi Eric,

On Wed, Nov 05, 2025 at 08:47:56AM +0100, Eric Auger wrote:
> > We aligned with Intel previously about this system address space.
> > You might know these very well, yet here are the breakdowns:
> >
> > 1. VFIO core has a container that manages an HWPT. By default, it
> >    allocates a stage-1 normal HWPT, unless vIOMMU requests for a

> You may precise this stage-1 normal HWPT is used to map GPA to HPA (so
> eventually implements stage 2).

Functional-wise, that would work. But not as clean as we create
an S2 parent hwpt from the beginning, right?

> >    nesting parent HWPT for accelerated cases.
> > 2. VFIO core adds a listener for that HWPT and sets up a handler
> >    vfio_container_region_add() where it checks the memory region
> >    whether it is iommu or not.
> >    a. In case of !IOMMU as (i.e. system address space), it treats
> >       the address space as a RAM region, and handles all stage-2
> >       mappings for the core allocated nesting parent HWPT.
> >    b. In case of IOMMU as (i.e. a translation type) it sets up
> >       the IOTLB notifier and translation replay while bypassing
> >       the listener for RAM region.

> yes S1+S2 are combined through vfio_iommu_map_notify()

But that map/unmap notifier is useless in the accelerated mode:
we don't need those translation code in the emulated mode (MSI
is likely to bypass translation as well); and we don't need the
emulated IOTLB either since no page table walk through.

Also, S1 and S2 are separated following iommufd design. In this
regard, letting the core manage the S2 hwpt and mappings while
vIOMMU handling the S1 hwpt allocation/attach/invalidation can
look much cleaner.

> > In an accelerated case, we need stage-2 mappings to match with the
> > nesting parent HWPT. So, returning system address space or an alias
> > of that notifies the vfio core to take the 2.a path.
> >
> > If we take 2.b path by returning IOMMU as in smmu_find_add_as, the
> > VFIO core would no longer listen to the RAM region for us, i.e. no
> > stage-2 HWPT nor mappings. vIOMMU would have to allocate a nesting

> except if you change the VFIO common.c as I did the past to force the S2
> mapping in the nested config.
> See
> https://lore.kernel.org/all/[email protected]/
> and vfio_prereg_listener()

Yea, I remember that. But that's somewhat duplicated IMHO. The
VFIO core already registers a listener on guest RAM for system
address space. Having another set of vfio_prereg_listener does
not feel optimal.

> Again I do not say this is the right way to do but using system address
> space is not the "only" implementation choice I think

Oh, neither do I mean that's the "only" way. Sorry I did not
make this clear.

I had studied your vfio_prereg_listener approach and studied
Intel's approach using the system address space, and concluded
this "cleaner" way that works for both architectures.

> and it needs to be
> properly justified, especially has it has at least 2 side effects:
> - somehow abusing the semantic of returned address space and pretends
> there is no IOMMU translation in place and

Perhaps we shall say "there is no emulated translation" :)

> - also impacting the way MSIs are handled (introduction of a new
> PCIIOMMUOps).

That is a solid point. Yet I think it's less confusing now per
Jason's remarks -- we will bypass the translation pathway for
MSI in accelerated mode.

> This kind of explanation you wrote is absolutely needed in the commit
> msg for reviewers to understand the design choice I think.

Sure. My bad that I didn't explain it well in the first place.

Thanks
Nicolin

Reply via email to