unbind guest page table to host

Nicolin Chen Fri, 23 May 2025 14:18:52 -0700

Hey,

Thanks for the reply.

Just want to say that I am asking a lot to understand why VT-d is
different than ARM, so as to decide whether ARM should follow VT-d
implementing a separate listener or just use the VFIO listener.

On Fri, May 23, 2025 at 02:22:15PM +0800, Yi Liu wrote:
> Hey Nic,
> 
> On 2025/5/22 06:49, Nicolin Chen wrote:
> > On Wed, May 21, 2025 at 07:14:45PM +0800, Zhenzhong Duan wrote:
> > > +static const MemoryListener iommufd_s2domain_memory_listener = {
> > > +    .name = "iommufd_s2domain",
> > > +    .priority = 1000,
> > > +    .region_add = iommufd_listener_region_add_s2domain,
> > > +    .region_del = iommufd_listener_region_del_s2domain,
> > > +};
> > 
> > Would you mind elaborating When and how vtd does all S2 mappings?
> > 
> > On ARM, the default vfio_memory_listener could capture the entire
> > guest RAM and add to the address space. So what we do is basically
> > reusing the vfio_memory_listener:
> > https://lore.kernel.org/qemu-devel/20250311141045.66620-13-shameerali.kolothum.th...@huawei.com/
> 
> in concept yes, all the guest ram. but due to an errata, we need
> to skip the RO mappings.

Mind elaborating what are RO mappings? Can those be possible within
the range of the RAM?

> > The thing is that when a VFIO device is attached to the container
> > upon a nesting configuration, the ->get_address_space op should
> > return the system address space as S1 nested HWPT isn't allocated
> > yet. Then all the iommu as routines in vfio_listener_region_add()
> > would be skipped, ending up with mapping the guest RAM in S2 HWPT
> > correctly. Not until the S1 nested HWPT is allocated by the guest
> > OS (after guest boots), can the ->get_address_space op return the
> > iommu address space.
> 
> This seems a bit different between ARM and VT-d emulation. The VT-d
> emulation code returns the iommu address space regardless of what
> translation mode guest configured. But the MR of the address space
> has two overlapped subregions, one is nodmar, another one is iommu.
> As the naming shows, the nodmar is aliased to the system MR.

OK. But why two overlapped subregions v.s. two separate two ASs?

> And before
> the guest enables iommu and set PGTT to a non-PT mode (e.g. S1 or S2),
> the effective MR alias is the nodmar, hence the mapping this address
> space holds are the GPA mappings in the beginning.

I think this is same on ARM, where get_address_space() may return
system address space. And for VT-d, it actually returns the range
of the system address space (just though a sub MR of an iommu AS),
right? 

> If guest set PGTT to S2,
> then the iommu MR is enabled, hence the mapping is gIOVA mappings
> accordingly. So in VT-d emulation, the address space switch is more the MR
> alias switching.

Zhenzhong said that there is no shadow page table for the nesting
setup, i.e. gIOVA=>gPA mappings are entirely done by the guest OS.

Then, why does VT-d need to switch to the iommu MR here?

> In this series, we mainly want to support S1 translation type for guest.
> And it is based on nested translation, which needs a S2 domain that holds
> the GPA mappings. Besides S1 translation type, PT is also supported. Both
> the two types need a S2 domain which already holds GPA mappings. So we have
> this internal listener.

Hmm, the reasoning to the last "so" doesn't sound enough. The VFIO
listener could do the same...

> Also, we want to skip RO mappings on S2, so that's
> another reason for it.  @Zhenzhong, perhaps, it can be described in the
> commit message why an internal listener is introduced.

OK. I think that can be a good reason to have an internal listener,
only if VFIO can't skip the RO mappings.

> > So the second question is:
> > Does vtd have to own this iommufd_s2domain_memory_listener? IOW,
> 
> yes based on the current design. when guest GPTT==PT, attach device
> to S2 hwpt, when it goes to S1, then attach it to a S1 hwpt whose
> parent is the aforementioned S2 hwpt. This S2 hwpt is always there
> for use.

ARM is doing the same thing. And the exact point "this S2 hwpt is
always there for use" has been telling me that the device can just
stay at the S2 address space (system), since the guest kernel will
take care of the S1 address space (iommu).

Overall, the questions here have been two-fold:

1.Why does VT-d need an internal listener?

  I can see the (only) reason is for the RO mappings.

  Yet, Is there anything that we can do to the VFIO listener to
  bypass these RO mappings?

2.Why not return the system AS all the time when nesting is on?
  Why switch to the iommu AS when device attaches to S1 HWPT?

  For ARM, MSI requires a translation so it has to; but MSI on
  VT-d doesn't. So, I couldn't see why VT-d needs to return the
  iommu AS via get_address_space().

  However, combining the question-1, my gut feeling is that, VT-d
  needs to skip RO mappings for errata, while the VFIO listener
  can't do that. So, VT-d has to create its own listener. And to
  avoid duplicated mappings on the same address space, it has to
  bypass the VFIO listener by working it around with an IOMMU AS.
  Is this correct?

Thanks
Nic

Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to host

Reply via email to