Re: [RFC PATCH 1/3] iommu: introduce IOMMU_DOMAIN_HYP domain type for hypervisor allocation

Alex Williamson Tue, 01 Jul 2014 12:28:38 -0700

On Tue, 2014-07-01 at 19:04 +0100, Will Deacon wrote:
> Hi Alex,
> 
> Thanks for having a look.
> 
> On Tue, Jul 01, 2014 at 06:42:33PM +0100, Alex Williamson wrote:
> > On Tue, 2014-07-01 at 17:10 +0100, Will Deacon wrote:
> > > Some IOMMUs, such as the ARM SMMU, support two stages of translation.
> > > The idea behind such a scheme is to allow a guest operating system to
> > > use the IOMMU for DMA mappings in the first stage of translation, with
> > > the hypervisor then installing mappings in the second stage to provide
> > > isolation of the DMA to the physical range assigned to that virtual
> > > machine.
> > > 
> > > In order to allow IOMMU domains to be allocated for second-stage
> > > translation, this patch extends iommu_domain_alloc (and the associated
> > > ->domain_init callback on struct iommu) to take a type parameter
> > > indicating the intended purpose for the domain. The only supported types
> > > at present are IOMMU_DOMAIN_DMA (i.e. what we have at the moment) and
> > > IOMMU_DOMAIN_HYP, which instructs the backend driver to allocate and
> > > initialise a second-stage domain, if possible.
> > > 
> > > All IOMMU drivers are updated to take the new type parameter, but it is
> > > ignored at present. All callers of iommu_domain_alloc are also updated
> > > to pass IOMMU_DOMAIN_DMA as the type parameter, apart from
> > > kvm_iommu_map_guest, which passes the new IOMMU_DOMAIN_HYP flag.
> > > 
> > > Finally, a new IOMMU capability, IOMMU_CAP_HYP_MAPPING, is added so that
> > > it is possible to check whether or not a domain is able to make use of
> > > nested translation.
> > 
> > Why is this necessarily related to HYPervisor use?  It seems like this
> > new domain type is effectively just a normal domain that supports some
> > sort of fault handler that can call out to attempt to create missing
> > mappings.
> 
> Not quite. The idea of this domain is that it provides isolation for a
> guest, so I'd actually expect these domains to contain pinned mappings most
> of the time (handling guest faults in the hypervisor is pretty error-prone).


Hmm, that's exactly what an existing IOMMU domain does and how we use it
through QEMU and VFIO today.

> Perhaps if I explain how the ARM SMMU works, that might help (and if it
> doesn't, please reply saying so :). The ARM SMMU supports two stages of
> translation:
> 
>   Stage-1: Guest VA (VA) -> Guest PA (IPA, or intermediate physical address)
>   Stage-2: IPA -> Host Physical Address (PA)
> 
> These can be glued together to form nested translation, where an incoming VA
> is translated through both stages to get a PA. Page table walks triggered at
> stage-1 expect to see IPAs for the table addresses.
> 
> An important thing to note here is that the hardware is configured
> differently at each stage; the page table formats themselves are slightly
> different (e.g. restricted permissions at stage-2) and certain hardware
> contexts are only capable of stage-2 translation.
> 
> The way this is supposed to work is that the KVM host would install the VFIO
> DMA mapping (ok, now I see why you don't like the name) at stage-2. This
> allows the SMMU driver to allocate a corresponding stage-1 context for the
> mapping and expose that directly to the guest as part of a virtual, 
> stage-1-only
> SMMU. Then the guest can install its own SMMU mappings at stage-1 for
> contiguous DMA (in the guest VA space) without any knowledge of the hypervisor
> mapping.
> 
> To do this successfully, we need to communicate the intention of the mapping
> to the SMMU driver (i.e. stage-1 vs stage-2) at domain initialisation time.
> I could just add ARM SMMU-specific properties there, but I thought this
> might potentially be useful to others.

If I understand, this is just nesting or two-dimensional paging support
and you need the lowest level translation to be setup differently
because the context entries for a stage-1 vs stage-2 page table are
different, right?  Is it then the case that you can also support a
traditional one-dimensional configuration where the SMMU is not exposed
and you'll likely have some QEMU switch to configure which will be used
for a given guest?

The code in 3/3 identifies this as a nested capable domain, so I'm not
sure why we'd want to make the assumption that it's for hypervisor use
when a hypervisor can use either type.

> > IOMMUs supporting PCI PRI (Page Request Interface) could
> > potentially make use of something like that on bare metal or under
> > hypervisor control.  If that's true, then could this be some sort of
> > iommu_domain_set_l2_handler() that happens after the domain is
> > allocated?
> 
> I'm not sure that's what I was aiming for... see above.
> 
> > For this patch, I don't understand why legacy KVM assignment would
> > allocate a HYP domain while VFIO would use a DMA domain.  It seems like
> > you're just counting on x86 never making the distinction between the
> > two.
> 
> That's true, but I was also trying to indicate the intention of the mapping
> so that other IOMMUs could potentially make use of the flags.

I find that defining an API on intentions is rarely a good idea.  Stick
with "what does this actually do".  In this case, I think we want an
interface that either specifies that the domain being allocated is
nesting capable or we want to be able to set a nesting attribute on a
domain, which may be possible after it's been allocated for a small
window before other operations have been done.

If we go the route of defining it at allocation time, I'd probably think
about adding a new interface, like:

iommu_domain_alloc_with_features(struct bus_type *bus, u32 features)

Where features is a bitmap

#define IOMMU_DOMAIN_FEATURE_NESTING (1U << 0)

iommu_domain_alloc(struct bus_type *bus) would then just be a wrapper
with features = 0.  If an IOMMU driver doesn't support a requested
feature, it should fail so we don't have cases like KVM requesting a
"HYP" domain when it doesn't need it and the IOMMU drivers don't support
it.

It would be a lot less intrusive if we could use iommu_domain_set_attr()
to enable nesting after allocation though.  This could return error if
called after the point at which it can be easily enabled.

> > > --- a/include/linux/iommu.h
> > > +++ b/include/linux/iommu.h
> > > @@ -49,6 +49,10 @@ struct iommu_domain_geometry {
> > >   bool force_aperture;       /* DMA only allowed in mappable range? */
> > >  };
> > >  
> > > +/* iommu domain types */
> > > +#define IOMMU_DOMAIN_DMA 0x0
> > > +#define IOMMU_DOMAIN_HYP 0x1
> > > +
> > >  struct iommu_domain {
> > >   struct iommu_ops *ops;
> > >   void *priv;
> > > @@ -59,6 +63,7 @@ struct iommu_domain {
> > >  
> > >  #define IOMMU_CAP_CACHE_COHERENCY        0x1
> > >  #define IOMMU_CAP_INTR_REMAP             0x2     /* isolates device 
> > > intrs */
> > > +#define IOMMU_CAP_HYP_MAPPING            0x3     /* isolates guest DMA */
> > 
> > This makes no sense, it's exactly what we do with a "DMA" domain.  I
> > think the code needs to focus on what is really different about this
> > domain, not what is the expected use case.  Thanks,
> 
> The use-case is certainly relevant, though. I can do device passthrough
> with a stage-1 mapping for example, but you wouldn't then be able to
> instantiate a virtual SMMU interface in the guest.

I think you've switched the definition of stage-1 here to what we
consider a normal IOMMU domain mapping today, so I'm going to suggest
that stage-1 vs stage-2 where which is which depends on the type of
domain allocated is way too confusing.

> I could rename these IOMMU_CAP_STAGE{1,2}, but that then sounds very
> ARM-specific. What do you think?

Very much so, IOMMU_CAP_NESTING, DOMAIN_ATTR_NESTING?  Thanks,

Alex

_______________________________________________
iommu mailing list
[email protected]
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC PATCH 1/3] iommu: introduce IOMMU_DOMAIN_HYP domain type for hypervisor allocation

Reply via email to