Re: [PATCH v2 1/3] iommu/uapi: Define uapi version and capabilities

2020-04-13 Thread Jacob Pan
Hi Alex,
Thanks a lot for the feedback, my comments inline.

On Mon, 13 Apr 2020 16:21:29 -0600
Alex Williamson  wrote:

> On Mon, 13 Apr 2020 13:41:57 -0700
> Jacob Pan  wrote:
> 
> > Hi All,
> > 
> > Just a gentle reminder, any feedback on the options I listed below?
> > New ideas will be even better.
> > 
> > Christoph, does the explanation make sense to you? We do have the
> > capability/flag based scheme for IOMMU API extension, the version is
> > mainly used for size lookup. Compatibility checking is another use
> > of the version, it makes checking easy when a vIOMMU is launched.
> > 
> > Thanks,
> > 
> > Jacob
> > 
> > On Thu, 2 Apr 2020 11:36:04 -0700
> > Jacob Pan  wrote:
> >   
> > > On Wed, 1 Apr 2020 05:32:21 +
> > > "Tian, Kevin"  wrote:
> > > 
> > > > > From: Jacob Pan 
> > > > > Sent: Tuesday, March 31, 2020 11:55 PM
> > > > > 
> > > > > On Tue, 31 Mar 2020 06:06:38 +
> > > > > "Tian, Kevin"  wrote:
> > > > > 
> > > > > > > From: Jacob Pan 
> > > > > > > Sent: Tuesday, March 31, 2020 12:08 AM
> > > > > > >
> > > > > > > On Mon, 30 Mar 2020 05:40:40 +
> > > > > > > "Tian, Kevin"  wrote:
> > > > > > >
> > > > > > > > > From: Jacob Pan 
> > > > > > > > > Sent: Saturday, March 28, 2020 7:54 AM
> > > > > > > > >
> > > > > > > > > On Fri, 27 Mar 2020 00:47:02 -0700
> > > > > > > > > Christoph Hellwig  wrote:
> > > > > > > > >
> > > > > > > > > > On Fri, Mar 27, 2020 at 02:49:55AM +, Tian,
> > > > > > > > > > Kevin wrote:
> > > > > > > > > > > If those API calls are inter-dependent for
> > > > > > > > > > > composing a feature (e.g. SVA), shouldn't we need
> > > > > > > > > > > a way to check them together before exposing the
> > > > > > > > > > > feature to the guest, e.g. through a
> > > > > > > > > > > iommu_get_uapi_capabilities interface?
> > > > > > > > > >
> > > > > > > > > > Yes, that makes sense.  The important bit is to
> > > > > > > > > > have a capability flags and not version
> > > > > > > > > > numbers.
> > > > > > > > >
> > > > > > > > > The challenge is that there are two consumers in the
> > > > > > > > > kernel for this. 1. VFIO only look for compatibility,
> > > > > > > > > and size of each data struct such that it can
> > > > > > > > > copy_from_user.
> > > > > > > > >
> > > > > > > > > 2. IOMMU driver, the "real consumer" of the content.
> > > > > > > > >
> > > > > > > > > For 2, I agree and we do plan to use the capability
> > > > > > > > > flags to check content and maintain backward
> > > > > > > > > compatibility etc.
> > > > > > > > >
> > > > > > > > > For VFIO, it is difficult to do size look up based on
> > > > > > > > > capability flags.
> > > > > > > >
> > > > > > > > Can you elaborate the difficulty in VFIO? if, as
> > > > > > > > Christoph Hellwig pointed out, version number is
> > > > > > > > already avoided everywhere, it is interesting to know
> > > > > > > > whether this work becomes a real exception or just
> > > > > > > > requires a different mindset. 
> > > > > > > From VFIO p.o.v. the IOMMU UAPI data is opaque, it only
> > > > > > > needs to do two things:
> > > > > > > 1. is the UAPI compatible?
> > > > > > > 2. what is the size to copy?
> > > > > > >
> > > > > > > If you look at the version number, this is really a
> > > > > > > "version as size" lookup, as provided by the helper
> > > > > > > function in this patch. An example can be the newly
> > > > > > > introduced clone3 syscall.
> > > > > > > https://lwn.net/Articles/792628/ In clone3, new version
> > > > > > > must have new size. The slight difference here is that,
> > > > > > > unlike clone3, we have multiple data structures instead
> > > > > > > of a single struct clone_args {}. And each struct has
> > > > > > > flags to enumerate its contents besides size.
> > > > > >
> > > > > > Thanks for providing that link. However clone3 doesn't
> > > > > > include a version field to do "version as size" lookup.
> > > > > > Instead, as you said, it includes a size parameter which
> > > > > > sounds like the option 3 (argsz) listed below.
> > > > > >
> > > > > Right, there is no version in clone3. size = version. I view
> > > > > this as a 1:1 lookup.
> > > > > 
> > > > > > >
> > > > > > > Besides breaching data abstraction, if VFIO has to check
> > > > > > > IOMMU flags to determine the sizes, it has many
> > > > > > > combinations.
> > > > > > >
> > > > > > > We also separate the responsibilities into two parts
> > > > > > > 1. compatibility - version, size by VFIO
> > > > > > > 2. sanity check - capability flags - by IOMMU
> > > > > >
> > > > > > I feel argsz+flags approach can perfectly meet above
> > > > > > requirement. The userspace set the size and flags for
> > > > > > whatever capabilities it uses, and VFIO simply copies the
> > > > > > parameters by size and pass to IOMMU for further sanity
> > > > > > check. Of course the assumption is that we do provide an
> > > > > > interface for 

RE: [PATCH v1 2/2] vfio/pci: Emulate PASID/PRI capability for VFs

2020-04-13 Thread Tian, Kevin
> From: Alex Williamson 
> Sent: Tuesday, April 14, 2020 11:29 AM
> 
> On Tue, 14 Apr 2020 02:40:58 +
> "Tian, Kevin"  wrote:
> 
> > > From: Alex Williamson 
> > > Sent: Tuesday, April 14, 2020 3:21 AM
> > >
> > > On Mon, 13 Apr 2020 08:05:33 +
> > > "Tian, Kevin"  wrote:
> > >
> > > > > From: Tian, Kevin
> > > > > Sent: Monday, April 13, 2020 3:55 PM
> > > > >
> > > > > > From: Raj, Ashok 
> > > > > > Sent: Monday, April 13, 2020 11:11 AM
> > > > > >
> > > > > > On Wed, Apr 08, 2020 at 10:19:40AM -0600, Alex Williamson wrote:
> > > > > > > On Tue, 7 Apr 2020 21:00:21 -0700
> > > > > > > "Raj, Ashok"  wrote:
> > > > > > >
> > > > > > > > Hi Alex
> > > > > > > >
> > > > > > > > + Bjorn
> > > > > > >
> > > > > > >  + Don
> > > > > > >
> > > > > > > > FWIW I can't understand why PCI SIG went different ways with
> ATS,
> > > > > > > > where its enumerated on PF and VF. But for PASID and PRI its
> only
> > > > > > > > in PF.
> > > > > > > >
> > > > > > > > I'm checking with our internal SIG reps to followup on that.
> > > > > > > >
> > > > > > > > On Tue, Apr 07, 2020 at 09:58:01AM -0600, Alex Williamson
> wrote:
> > > > > > > > > > Is there vendor guarantee that hidden registers will locate 
> > > > > > > > > > at
> the
> > > > > > > > > > same offset between PF and VF config space?
> > > > > > > > >
> > > > > > > > > I'm not sure if the spec really precludes hidden registers, 
> > > > > > > > > but
> the
> > > > > > > > > fact that these registers are explicitly outside of the 
> > > > > > > > > capability
> > > > > > > > > chain implies they're only intended for device specific use, 
> > > > > > > > > so
> I'd
> > > say
> > > > > > > > > there are no guarantees about anything related to these
> registers.
> > > > > > > >
> > > > > > > > As you had suggested in the other thread, we could consider
> > > > > > > > using the same offset as in PF, but even that's a better guess
> > > > > > > > still not reliable.
> > > > > > > >
> > > > > > > > The other option is to maybe extend driver ops in the PF to
> expose
> > > > > > > > where the offsets should be. Sort of adding the quirk in the
> > > > > > > > implementation.
> > > > > > > >
> > > > > > > > I'm not sure how prevalent are PASID and PRI in VF devices. If
> SIG is
> > > > > > resisting
> > > > > > > > making VF's first class citizen, we might ask them to add some
> > > verbiage
> > > > > > > > to suggest leave the same offsets as PF open to help emulation
> > > software.
> > > > > > >
> > > > > > > Even if we know where to expose these capabilities on the VF, it's
> not
> > > > > > > clear to me how we can actually virtualize the capability itself. 
> > > > > > >  If
> > > > > > > the spec defines, for example, an enable bit as r/w then software
> that
> > > > > > > interacts with that register expects the bit is settable.  
> > > > > > > There's no
> > > > > > > protocol for "try to set the bit and re-read it to see if the 
> > > > > > > hardware
> > > > > > > accepted it".  Therefore a capability with a fixed enable bit
> > > > > > > representing the state of the PF, not settable by the VF, is
> > > > > > > disingenuous to the spec.
> > > > > >
> > > > > > I think we are all in violent agreement. A lot of times the pci spec
> gets
> > > > > > defined several years ahead of real products and no one
> remembers
> > > > > > the justification on why they restricted things the way they did.
> > > > > >
> > > > > > Maybe someone early product wasn't quite exposing these features
> to
> > > the
> > > > > > VF
> > > > > > and hence the spec is bug compatible :-)
> > > > > >
> > > > > > >
> > > > > > > If what we're trying to do is expose that PASID and PRI are 
> > > > > > > enabled
> on
> > > > > > > the PF to a VF driver, maybe duplicating the PF capabilities on 
> > > > > > > the
> VF
> > > > > > > without the ability to control it is not the right approach.  
> > > > > > > Maybe
> we
> > > > > >
> > > > > > As long as the capability enable is only provided when the PF has
> > > enabled
> > > > > > the feature. Then it seems the hardware seems to do the right thing.
> > > > > >
> > > > > > Assume we expose PASID/PRI only when PF has enabled it. It will be
> the
> > > > > > case since the PF driver needs to exist, and IOMMU would have set
> the
> > > > > > PASID/PRI/ATS on PF.
> > > > > >
> > > > > > If the emulation is purely spoofing the capability. Once vIOMMU
> driver
> > > > > > enables PASID, the context entries for the VF are completely
> > > independent
> > > > > > from the PF context entries.
> > > > > >
> > > > > > vIOMMU would enable PASID, and we just spoof the PASID
> capability.
> > > > > >
> > > > > > If vIOMMU or guest for some reason does disable_pasid(), then the
> > > > > > vIOMMU driver can disaable PASID on the VF context entries. So the
> VF
> > > > > > although the capability is blanket enabled on PF, IOMMU gaurantees
> > > the
> > > > > > transactions are blocked.
> > > > > >
> > > > > >
> > > > > > In the interim, it seems 

Re: [PATCH v1 2/2] vfio/pci: Emulate PASID/PRI capability for VFs

2020-04-13 Thread Alex Williamson
On Tue, 14 Apr 2020 02:40:58 +
"Tian, Kevin"  wrote:

> > From: Alex Williamson 
> > Sent: Tuesday, April 14, 2020 3:21 AM
> > 
> > On Mon, 13 Apr 2020 08:05:33 +
> > "Tian, Kevin"  wrote:
> >   
> > > > From: Tian, Kevin
> > > > Sent: Monday, April 13, 2020 3:55 PM
> > > >  
> > > > > From: Raj, Ashok 
> > > > > Sent: Monday, April 13, 2020 11:11 AM
> > > > >
> > > > > On Wed, Apr 08, 2020 at 10:19:40AM -0600, Alex Williamson wrote:  
> > > > > > On Tue, 7 Apr 2020 21:00:21 -0700
> > > > > > "Raj, Ashok"  wrote:
> > > > > >  
> > > > > > > Hi Alex
> > > > > > >
> > > > > > > + Bjorn  
> > > > > >
> > > > > >  + Don
> > > > > >  
> > > > > > > FWIW I can't understand why PCI SIG went different ways with ATS,
> > > > > > > where its enumerated on PF and VF. But for PASID and PRI its only
> > > > > > > in PF.
> > > > > > >
> > > > > > > I'm checking with our internal SIG reps to followup on that.
> > > > > > >
> > > > > > > On Tue, Apr 07, 2020 at 09:58:01AM -0600, Alex Williamson wrote:  
> > > > > > > > > Is there vendor guarantee that hidden registers will locate 
> > > > > > > > > at the
> > > > > > > > > same offset between PF and VF config space?  
> > > > > > > >
> > > > > > > > I'm not sure if the spec really precludes hidden registers, but 
> > > > > > > > the
> > > > > > > > fact that these registers are explicitly outside of the 
> > > > > > > > capability
> > > > > > > > chain implies they're only intended for device specific use, so 
> > > > > > > > I'd  
> > say  
> > > > > > > > there are no guarantees about anything related to these 
> > > > > > > > registers.  
> > > > > > >
> > > > > > > As you had suggested in the other thread, we could consider
> > > > > > > using the same offset as in PF, but even that's a better guess
> > > > > > > still not reliable.
> > > > > > >
> > > > > > > The other option is to maybe extend driver ops in the PF to expose
> > > > > > > where the offsets should be. Sort of adding the quirk in the
> > > > > > > implementation.
> > > > > > >
> > > > > > > I'm not sure how prevalent are PASID and PRI in VF devices. If 
> > > > > > > SIG is  
> > > > > resisting  
> > > > > > > making VF's first class citizen, we might ask them to add some  
> > verbiage  
> > > > > > > to suggest leave the same offsets as PF open to help emulation  
> > software.  
> > > > > >
> > > > > > Even if we know where to expose these capabilities on the VF, it's 
> > > > > > not
> > > > > > clear to me how we can actually virtualize the capability itself.  
> > > > > > If
> > > > > > the spec defines, for example, an enable bit as r/w then software 
> > > > > > that
> > > > > > interacts with that register expects the bit is settable.  There's 
> > > > > > no
> > > > > > protocol for "try to set the bit and re-read it to see if the 
> > > > > > hardware
> > > > > > accepted it".  Therefore a capability with a fixed enable bit
> > > > > > representing the state of the PF, not settable by the VF, is
> > > > > > disingenuous to the spec.  
> > > > >
> > > > > I think we are all in violent agreement. A lot of times the pci spec 
> > > > > gets
> > > > > defined several years ahead of real products and no one remembers
> > > > > the justification on why they restricted things the way they did.
> > > > >
> > > > > Maybe someone early product wasn't quite exposing these features to  
> > the  
> > > > > VF
> > > > > and hence the spec is bug compatible :-)
> > > > >  
> > > > > >
> > > > > > If what we're trying to do is expose that PASID and PRI are enabled 
> > > > > > on
> > > > > > the PF to a VF driver, maybe duplicating the PF capabilities on the 
> > > > > > VF
> > > > > > without the ability to control it is not the right approach.  Maybe 
> > > > > > we  
> > > > >
> > > > > As long as the capability enable is only provided when the PF has  
> > enabled  
> > > > > the feature. Then it seems the hardware seems to do the right thing.
> > > > >
> > > > > Assume we expose PASID/PRI only when PF has enabled it. It will be the
> > > > > case since the PF driver needs to exist, and IOMMU would have set the
> > > > > PASID/PRI/ATS on PF.
> > > > >
> > > > > If the emulation is purely spoofing the capability. Once vIOMMU driver
> > > > > enables PASID, the context entries for the VF are completely  
> > independent  
> > > > > from the PF context entries.
> > > > >
> > > > > vIOMMU would enable PASID, and we just spoof the PASID capability.
> > > > >
> > > > > If vIOMMU or guest for some reason does disable_pasid(), then the
> > > > > vIOMMU driver can disaable PASID on the VF context entries. So the VF
> > > > > although the capability is blanket enabled on PF, IOMMU gaurantees  
> > the  
> > > > > transactions are blocked.
> > > > >
> > > > >
> > > > > In the interim, it seems like the intent of the virtual capability
> > > > > can be honored via help from the IOMMU for the controlling aspect..
> > > > >
> > > > > Did i miss anything?  
> > > >
> > > > Above works for emulating the 

RE: [PATCH v1 2/2] vfio/pci: Emulate PASID/PRI capability for VFs

2020-04-13 Thread Tian, Kevin
> From: Alex Williamson 
> Sent: Tuesday, April 14, 2020 3:21 AM
> 
> On Mon, 13 Apr 2020 08:05:33 +
> "Tian, Kevin"  wrote:
> 
> > > From: Tian, Kevin
> > > Sent: Monday, April 13, 2020 3:55 PM
> > >
> > > > From: Raj, Ashok 
> > > > Sent: Monday, April 13, 2020 11:11 AM
> > > >
> > > > On Wed, Apr 08, 2020 at 10:19:40AM -0600, Alex Williamson wrote:
> > > > > On Tue, 7 Apr 2020 21:00:21 -0700
> > > > > "Raj, Ashok"  wrote:
> > > > >
> > > > > > Hi Alex
> > > > > >
> > > > > > + Bjorn
> > > > >
> > > > >  + Don
> > > > >
> > > > > > FWIW I can't understand why PCI SIG went different ways with ATS,
> > > > > > where its enumerated on PF and VF. But for PASID and PRI its only
> > > > > > in PF.
> > > > > >
> > > > > > I'm checking with our internal SIG reps to followup on that.
> > > > > >
> > > > > > On Tue, Apr 07, 2020 at 09:58:01AM -0600, Alex Williamson wrote:
> > > > > > > > Is there vendor guarantee that hidden registers will locate at 
> > > > > > > > the
> > > > > > > > same offset between PF and VF config space?
> > > > > > >
> > > > > > > I'm not sure if the spec really precludes hidden registers, but 
> > > > > > > the
> > > > > > > fact that these registers are explicitly outside of the capability
> > > > > > > chain implies they're only intended for device specific use, so 
> > > > > > > I'd
> say
> > > > > > > there are no guarantees about anything related to these registers.
> > > > > >
> > > > > > As you had suggested in the other thread, we could consider
> > > > > > using the same offset as in PF, but even that's a better guess
> > > > > > still not reliable.
> > > > > >
> > > > > > The other option is to maybe extend driver ops in the PF to expose
> > > > > > where the offsets should be. Sort of adding the quirk in the
> > > > > > implementation.
> > > > > >
> > > > > > I'm not sure how prevalent are PASID and PRI in VF devices. If SIG 
> > > > > > is
> > > > resisting
> > > > > > making VF's first class citizen, we might ask them to add some
> verbiage
> > > > > > to suggest leave the same offsets as PF open to help emulation
> software.
> > > > >
> > > > > Even if we know where to expose these capabilities on the VF, it's not
> > > > > clear to me how we can actually virtualize the capability itself.  If
> > > > > the spec defines, for example, an enable bit as r/w then software that
> > > > > interacts with that register expects the bit is settable.  There's no
> > > > > protocol for "try to set the bit and re-read it to see if the hardware
> > > > > accepted it".  Therefore a capability with a fixed enable bit
> > > > > representing the state of the PF, not settable by the VF, is
> > > > > disingenuous to the spec.
> > > >
> > > > I think we are all in violent agreement. A lot of times the pci spec 
> > > > gets
> > > > defined several years ahead of real products and no one remembers
> > > > the justification on why they restricted things the way they did.
> > > >
> > > > Maybe someone early product wasn't quite exposing these features to
> the
> > > > VF
> > > > and hence the spec is bug compatible :-)
> > > >
> > > > >
> > > > > If what we're trying to do is expose that PASID and PRI are enabled on
> > > > > the PF to a VF driver, maybe duplicating the PF capabilities on the VF
> > > > > without the ability to control it is not the right approach.  Maybe we
> > > >
> > > > As long as the capability enable is only provided when the PF has
> enabled
> > > > the feature. Then it seems the hardware seems to do the right thing.
> > > >
> > > > Assume we expose PASID/PRI only when PF has enabled it. It will be the
> > > > case since the PF driver needs to exist, and IOMMU would have set the
> > > > PASID/PRI/ATS on PF.
> > > >
> > > > If the emulation is purely spoofing the capability. Once vIOMMU driver
> > > > enables PASID, the context entries for the VF are completely
> independent
> > > > from the PF context entries.
> > > >
> > > > vIOMMU would enable PASID, and we just spoof the PASID capability.
> > > >
> > > > If vIOMMU or guest for some reason does disable_pasid(), then the
> > > > vIOMMU driver can disaable PASID on the VF context entries. So the VF
> > > > although the capability is blanket enabled on PF, IOMMU gaurantees
> the
> > > > transactions are blocked.
> > > >
> > > >
> > > > In the interim, it seems like the intent of the virtual capability
> > > > can be honored via help from the IOMMU for the controlling aspect..
> > > >
> > > > Did i miss anything?
> > >
> > > Above works for emulating the enable bit (under the assumption that
> > > PF driver won't disable pasid when vf is assigned). However, there are
> > > also "Execute permission enable" and "Privileged mode enable" bits in
> > > PASID control registers. I don't know how those bits could be cleanly
> > > emulated when the guest writes a value different from PF's...
> >
> > sent too quick. the IOMMU also includes control bits for allowing/
> > blocking execute requests and supervisor requests. We can rely 

Re: [RFC PATCH] iommu/amd: fix a race in fetch_pte()

2020-04-13 Thread Qian Cai


> On Apr 8, 2020, at 10:19 AM, Joerg Roedel  wrote:
> 
> Hi Qian,
> 
> On Tue, Apr 07, 2020 at 11:36:05AM -0400, Qian Cai wrote:
>> After further testing, the change along is insufficient. What I am chasing 
>> right
>> now is the swap device will go offline after heavy memory pressure below. The
>> symptom is similar to what we have in the commit,
>> 
>> 754265bcab78 (“iommu/amd: Fix race in increase_address_space()”)
>> 
>> Apparently, it is no possible to take the domain->lock in fetch_pte() 
>> because it
>> could sleep.
> 
> Thanks a lot for finding and tracking down another race in the AMD IOMMU
> page-table code.  The domain->lock is a spin-lock and taking it can't
> sleep. But fetch_pte() is a fast-path and must not take any locks.
> 
> I think the best fix is to update the pt_root and mode of the domain
> atomically by storing the mode in the lower 12 bits of pt_root. This way
> they are stored together and can be read/write atomically.

Like this?

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 20cce366e951..b36c6b07cbfd 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -1397,13 +1397,13 @@ static struct page *free_sub_pt(unsigned long root, int 
mode,
 
 static void free_pagetable(struct protection_domain *domain)
 {
-   unsigned long root = (unsigned long)domain->pt_root;
+   int level = iommu_get_mode(domain->pt_root);
+   unsigned long root = iommu_get_root(domain->pt_root);
struct page *freelist = NULL;
 
-   BUG_ON(domain->mode < PAGE_MODE_NONE ||
-  domain->mode > PAGE_MODE_6_LEVEL);
+   BUG_ON(level < PAGE_MODE_NONE || level > PAGE_MODE_6_LEVEL);
 
-   freelist = free_sub_pt(root, domain->mode, freelist);
+   freelist = free_sub_pt(root, level, freelist);
 
free_page_list(freelist);
 }
@@ -1417,24 +1417,27 @@ static bool increase_address_space(struct 
protection_domain *domain,
   unsigned long address,
   gfp_t gfp)
 {
+   int level;
unsigned long flags;
bool ret = false;
u64 *pte;
 
spin_lock_irqsave(>lock, flags);
 
-   if (address <= PM_LEVEL_SIZE(domain->mode) ||
-   WARN_ON_ONCE(domain->mode == PAGE_MODE_6_LEVEL))
+   level = iommu_get_mode(domain->pt_root);
+
+   if (address <= PM_LEVEL_SIZE(level) ||
+   WARN_ON_ONCE(level == PAGE_MODE_6_LEVEL))
goto out;
 
pte = (void *)get_zeroed_page(gfp);
if (!pte)
goto out;
 
-   *pte = PM_LEVEL_PDE(domain->mode,
-   iommu_virt_to_phys(domain->pt_root));
-   domain->pt_root  = pte;
-   domain->mode+= 1;
+   *pte = PM_LEVEL_PDE(level,
+   iommu_virt_to_phys((void *)iommu_get_root(domain->pt_root)));
+
+   WRITE_ONCE(domain->pt_root, (unsigned long)pte + level + 1);
 
ret = true;
 
@@ -1452,15 +1455,17 @@ static u64 *alloc_pte(struct protection_domain *domain,
  bool *updated)
 {
int level, end_lvl;
-   u64 *pte, *page;
+   u64 *pte, *page, *pt_root, *root;
 
BUG_ON(!is_power_of_2(page_size));
 
-   while (address > PM_LEVEL_SIZE(domain->mode))
+   while (address > PM_LEVEL_SIZE(iommu_get_mode(domain->pt_root)))
*updated = increase_address_space(domain, address, gfp) || 
*updated;
 
-   level   = domain->mode - 1;
-   pte = >pt_root[PM_LEVEL_INDEX(level, address)];
+   pt_root = READ_ONCE(domain->pt_root);
+   root= (void *)iommu_get_root(pt_root);
+   level   = iommu_get_mode(pt_root) - 1;
+   pte = [PM_LEVEL_INDEX(level, address)];
address = PAGE_SIZE_ALIGN(address, page_size);
end_lvl = PAGE_SIZE_LEVEL(page_size);
 
@@ -1536,16 +1541,18 @@ static u64 *fetch_pte(struct protection_domain *domain,
  unsigned long address,
  unsigned long *page_size)
 {
-   int level;
u64 *pte;
+   u64 *pt_root = READ_ONCE(domain->pt_root);
+   u64 *root= (void *)iommu_get_root(pt_root);
+   int level= iommu_get_mode(pt_root);
 
*page_size = 0;
 
-   if (address > PM_LEVEL_SIZE(domain->mode))
+   if (address > PM_LEVEL_SIZE(level))
return NULL;
 
-   level  =  domain->mode - 1;
-   pte= >pt_root[PM_LEVEL_INDEX(level, address)];
+   level--;
+   pte= [PM_LEVEL_INDEX(level, address)];
*page_size =  PTE_LEVEL_PAGE_SIZE(level);
 
while (level > 0) {
@@ -1814,12 +1821,13 @@ static struct protection_domain 
*dma_ops_domain_alloc(void)
if (protection_domain_init(domain))
goto free_domain;
 
-   domain->mode = PAGE_MODE_3_LEVEL;
domain->pt_root = (void *)get_zeroed_page(GFP_KERNEL);
domain->flags = PD_DMA_OPS_MASK;
if (!domain->pt_root)
goto free_domain;
 
+  

Re: [PATCH 2/2] iommu/arm-smmu: Allow client devices to select direct mapping

2020-04-13 Thread Evan Green
On Wed, Jan 22, 2020 at 3:48 AM Sai Prakash Ranjan
 wrote:
>
> From: Jordan Crouse 
>
> Some client devices want to directly map the IOMMU themselves instead
> of using the DMA domain. Allow those devices to opt in to direct
> mapping by way of a list of compatible strings.
>
> Signed-off-by: Jordan Crouse 
> Co-developed-by: Sai Prakash Ranjan 
> Signed-off-by: Sai Prakash Ranjan 
> ---
>  drivers/iommu/arm-smmu-qcom.c | 39 +++
>  drivers/iommu/arm-smmu.c  |  3 +++
>  drivers/iommu/arm-smmu.h  |  5 +
>  3 files changed, 47 insertions(+)
>
> diff --git a/drivers/iommu/arm-smmu-qcom.c b/drivers/iommu/arm-smmu-qcom.c
> index 64a4ab270ab7..ff746acd1c81 100644
> --- a/drivers/iommu/arm-smmu-qcom.c
> +++ b/drivers/iommu/arm-smmu-qcom.c
> @@ -3,6 +3,7 @@
>   * Copyright (c) 2019, The Linux Foundation. All rights reserved.
>   */
>
> +#include 
>  #include 
>
>  #include "arm-smmu.h"
> @@ -11,6 +12,43 @@ struct qcom_smmu {
> struct arm_smmu_device smmu;
>  };
>
> +static const struct arm_smmu_client_match_data qcom_adreno = {
> +   .direct_mapping = true,
> +};
> +
> +static const struct arm_smmu_client_match_data qcom_mdss = {
> +   .direct_mapping = true,

I don't actually see direct_mapping being used. Shouldn't this member
be checked somewhere?

-Evan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] of_device: removed #include that caused a recursion in included headers

2020-04-13 Thread kbuild test robot
Hi Hadar,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on stm32/stm32-next]
[also build test ERROR on sunxi/sunxi/for-next tegra/for-next linus/master 
v5.7-rc1 next-20200413]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:
https://github.com/0day-ci/linux/commits/Hadar-Gat/of_device-removed-include-that-caused-a-recursion-in-included-headers/20200414-032638
base:   https://git.kernel.org/pub/scm/linux/kernel/git/atorgue/stm32.git 
stm32-next
config: arm-multi_v5_defconfig (attached as .config)
compiler: arm-linux-gnueabi-gcc (GCC) 9.3.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=9.3.0 make.cross ARCH=arm 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kbuild test robot 

All errors (new ones prefixed by >>):

   drivers/bus/imx-weim.c: In function 'weim_parse_dt':
>> drivers/bus/imx-weim.c:227:9: error: implicit declaration of function 
>> 'of_platform_default_populate' [-Werror=implicit-function-declaration]
 227 |   ret = of_platform_default_populate(pdev->dev.of_node,
 | ^~~~
   cc1: some warnings being treated as errors
--
   drivers/pinctrl/freescale/pinctrl-imx1-core.c: In function 
'imx1_pinctrl_core_probe':
>> drivers/pinctrl/freescale/pinctrl-imx1-core.c:639:8: error: implicit 
>> declaration of function 'of_platform_populate' 
>> [-Werror=implicit-function-declaration]
 639 |  ret = of_platform_populate(pdev->dev.of_node, NULL, NULL, 
>dev);
 |^~~~
   cc1: some warnings being treated as errors

vim +/of_platform_default_populate +227 drivers/bus/imx-weim.c

85bf6d4e4b100e Huang Shijie  2013-05-28  189  
4a92f07816ba30 Sascha Hauer  2019-08-14  190  static int 
weim_parse_dt(struct platform_device *pdev, void __iomem *base)
85bf6d4e4b100e Huang Shijie  2013-05-28  191  {
3f98b6baad63b1 Alexander Shiyan  2013-06-29  192const struct 
of_device_id *of_id = of_match_device(weim_id_table,
3f98b6baad63b1 Alexander Shiyan  2013-06-29  193
   >dev);
3f98b6baad63b1 Alexander Shiyan  2013-06-29  194const struct 
imx_weim_devtype *devtype = of_id->data;
85bf6d4e4b100e Huang Shijie  2013-05-28  195struct device_node 
*child;
52c47b63412b09 Alison Chaiken2015-02-18  196int ret, have_child = 0;
c7995bcb36ef61 Sven Van Asbroeck 2018-12-17  197struct cs_timing_state 
ts = {};
77266e722feabb Sven Van Asbroeck 2019-07-12  198u32 reg;
85bf6d4e4b100e Huang Shijie  2013-05-28  199  
8d9ee21e98205e Shawn Guo 2014-02-11  200if (devtype == 
_weim_devtype) {
8d9ee21e98205e Shawn Guo 2014-02-11  201ret = 
imx_weim_gpr_setup(pdev);
8d9ee21e98205e Shawn Guo 2014-02-11  202if (ret)
8d9ee21e98205e Shawn Guo 2014-02-11  203return 
ret;
8d9ee21e98205e Shawn Guo 2014-02-11  204}
8d9ee21e98205e Shawn Guo 2014-02-11  205  
77266e722feabb Sven Van Asbroeck 2019-07-12  206if 
(of_property_read_bool(pdev->dev.of_node, "fsl,burst-clk-enable")) {
77266e722feabb Sven Van Asbroeck 2019-07-12  207if 
(devtype->wcr_bcm) {
77266e722feabb Sven Van Asbroeck 2019-07-12  208reg = 
readl(base + devtype->wcr_offset);
77266e722feabb Sven Van Asbroeck 2019-07-12  209
writel(reg | devtype->wcr_bcm,
77266e722feabb Sven Van Asbroeck 2019-07-12  210
base + devtype->wcr_offset);
77266e722feabb Sven Van Asbroeck 2019-07-12  211} else {
77266e722feabb Sven Van Asbroeck 2019-07-12  212
dev_err(>dev, "burst clk mode not supported.\n");
77266e722feabb Sven Van Asbroeck 2019-07-12  213return 
-EINVAL;
77266e722feabb Sven Van Asbroeck 2019-07-12  214}
77266e722feabb Sven Van Asbroeck 2019-07-12  215}
77266e722feabb Sven Van Asbroeck 2019-07-12  216  
33b96d2c957921 Fabio Estevam 2016-02-22  217
for_each_available_child_of_node(pdev->dev.of_node, child) {
c7995bcb36ef61 Sven Van Asbroeck 2018-12-17  218ret = 
weim_timing_setup(>dev, child, base, devtype, );
52c47b63412b09 Alison Chaiken2015-02-18  219if (ret)
9c0982d809fd81 Rob Herring   2017-07-18  220
dev_warn(>dev, "%pOF set timing failed.\n",
9c0982d809fd81 Rob Herring   

Re: [PATCH] of_device: removed #include that caused a recursion in included headers

2020-04-13 Thread kbuild test robot
Hi Hadar,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on stm32/stm32-next]
[also build test ERROR on linus/master v5.7-rc1 next-20200413]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:
https://github.com/0day-ci/linux/commits/Hadar-Gat/of_device-removed-include-that-caused-a-recursion-in-included-headers/20200414-032638
base:   https://git.kernel.org/pub/scm/linux/kernel/git/atorgue/stm32.git 
stm32-next
config: sparc-randconfig-a001-20200413 (attached as .config)
compiler: sparc-linux-gcc (GCC) 9.3.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=9.3.0 make.cross ARCH=sparc 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kbuild test robot 

All errors (new ones prefixed by >>):

   arch/sparc/mm/iommu.c: In function 'iommu_init':
>> arch/sparc/mm/iommu.c:139:32: error: implicit declaration of function 
>> 'of_find_device_by_node'; did you mean 'bus_find_device_by_fwnode'? 
>> [-Werror=implicit-function-declaration]
 139 |   struct platform_device *op = of_find_device_by_node(dp);
 |^~
 |bus_find_device_by_fwnode
>> arch/sparc/mm/iommu.c:139:32: error: initialization of 'struct 
>> platform_device *' from 'int' makes pointer from integer without a cast 
>> [-Werror=int-conversion]
   cc1: all warnings being treated as errors
--
   arch/sparc/mm/io-unit.c: In function 'iounit_init':
>> arch/sparc/mm/io-unit.c:81:32: error: implicit declaration of function 
>> 'of_find_device_by_node'; did you mean 'bus_find_device_by_fwnode'? 
>> [-Werror=implicit-function-declaration]
  81 |   struct platform_device *op = of_find_device_by_node(dp);
 |^~
 |bus_find_device_by_fwnode
>> arch/sparc/mm/io-unit.c:81:32: error: initialization of 'struct 
>> platform_device *' from 'int' makes pointer from integer without a cast 
>> [-Werror=int-conversion]
   cc1: all warnings being treated as errors

vim +139 arch/sparc/mm/iommu.c

^1da177e4c3f41 Linus Torvalds  2005-04-16  133  
046e26a8ba10b8 David S. Miller 2008-08-27  134  static int __init 
iommu_init(void)
046e26a8ba10b8 David S. Miller 2008-08-27  135  {
046e26a8ba10b8 David S. Miller 2008-08-27  136  struct device_node *dp;
046e26a8ba10b8 David S. Miller 2008-08-27  137  
046e26a8ba10b8 David S. Miller 2008-08-27  138  
for_each_node_by_name(dp, "iommu") {
cd4cd7306a403f Grant Likely2010-07-22 @139  struct 
platform_device *op = of_find_device_by_node(dp);
046e26a8ba10b8 David S. Miller 2008-08-27  140  
046e26a8ba10b8 David S. Miller 2008-08-27  141  
sbus_iommu_init(op);
046e26a8ba10b8 David S. Miller 2008-08-27  142  
of_propagate_archdata(op);
046e26a8ba10b8 David S. Miller 2008-08-27  143  }
046e26a8ba10b8 David S. Miller 2008-08-27  144  
046e26a8ba10b8 David S. Miller 2008-08-27  145  return 0;
046e26a8ba10b8 David S. Miller 2008-08-27  146  }
046e26a8ba10b8 David S. Miller 2008-08-27  147  

:: The code at line 139 was first introduced by commit
:: cd4cd7306a403f62ef3ca783b9d1cf2a03e595ed sparc: remove references to 
of_device and to_of_device

:: TO: Grant Likely 
:: CC: Grant Likely 

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org


.config.gz
Description: application/gzip
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH] of_device: removed #include that caused a recursion in included headers

2020-04-13 Thread kbuild test robot
Hi Hadar,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on stm32/stm32-next]
[also build test ERROR on sunxi/sunxi/for-next tegra/for-next linus/master 
v5.7-rc1 next-20200413]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:
https://github.com/0day-ci/linux/commits/Hadar-Gat/of_device-removed-include-that-caused-a-recursion-in-included-headers/20200414-032638
base:   https://git.kernel.org/pub/scm/linux/kernel/git/atorgue/stm32.git 
stm32-next
config: x86_64-randconfig-h003-20200413 (attached as .config)
compiler: clang version 11.0.0 (https://github.com/llvm/llvm-project 
8e2daa0c7f27b5d13b11bff68ae3cd42329abd6c)
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
COMPILER=clang make.cross ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kbuild test robot 

All errors (new ones prefixed by >>):

>> drivers/clk/mediatek/clk-mt7622-aud.c:153:6: error: implicit declaration of 
>> function 'devm_of_platform_populate' 
>> [-Werror,-Wimplicit-function-declaration]
   r = devm_of_platform_populate(>dev);
   ^
   1 error generated.

vim +/devm_of_platform_populate +153 drivers/clk/mediatek/clk-mt7622-aud.c

2fc0a509e4ee85 Sean Wang 2017-10-05  132  
2fc0a509e4ee85 Sean Wang 2017-10-05  133  static int 
clk_mt7622_audiosys_init(struct platform_device *pdev)
2fc0a509e4ee85 Sean Wang 2017-10-05  134  {
2fc0a509e4ee85 Sean Wang 2017-10-05  135struct clk_onecell_data 
*clk_data;
2fc0a509e4ee85 Sean Wang 2017-10-05  136struct device_node *node = 
pdev->dev.of_node;
2fc0a509e4ee85 Sean Wang 2017-10-05  137int r;
2fc0a509e4ee85 Sean Wang 2017-10-05  138  
2fc0a509e4ee85 Sean Wang 2017-10-05  139clk_data = 
mtk_alloc_clk_data(CLK_AUDIO_NR_CLK);
2fc0a509e4ee85 Sean Wang 2017-10-05  140  
2fc0a509e4ee85 Sean Wang 2017-10-05  141mtk_clk_register_gates(node, 
audio_clks, ARRAY_SIZE(audio_clks),
2fc0a509e4ee85 Sean Wang 2017-10-05  142   
clk_data);
2fc0a509e4ee85 Sean Wang 2017-10-05  143  
2fc0a509e4ee85 Sean Wang 2017-10-05  144r = of_clk_add_provider(node, 
of_clk_src_onecell_get, clk_data);
037b21133e5367 Ryder Lee 2018-03-20  145if (r) {
2fc0a509e4ee85 Sean Wang 2017-10-05  146dev_err(>dev,
2fc0a509e4ee85 Sean Wang 2017-10-05  147"could not 
register clock provider: %s: %d\n",
2fc0a509e4ee85 Sean Wang 2017-10-05  148pdev->name, r);
2fc0a509e4ee85 Sean Wang 2017-10-05  149  
037b21133e5367 Ryder Lee 2018-03-20  150goto err_clk_provider;
037b21133e5367 Ryder Lee 2018-03-20  151}
037b21133e5367 Ryder Lee 2018-03-20  152  
037b21133e5367 Ryder Lee 2018-03-20 @153r = 
devm_of_platform_populate(>dev);
037b21133e5367 Ryder Lee 2018-03-20  154if (r)
037b21133e5367 Ryder Lee 2018-03-20  155goto err_plat_populate;
037b21133e5367 Ryder Lee 2018-03-20  156  
037b21133e5367 Ryder Lee 2018-03-20  157return 0;
037b21133e5367 Ryder Lee 2018-03-20  158  
037b21133e5367 Ryder Lee 2018-03-20  159  err_plat_populate:
037b21133e5367 Ryder Lee 2018-03-20  160of_clk_del_provider(node);
037b21133e5367 Ryder Lee 2018-03-20  161  err_clk_provider:
2fc0a509e4ee85 Sean Wang 2017-10-05  162return r;
2fc0a509e4ee85 Sean Wang 2017-10-05  163  }
2fc0a509e4ee85 Sean Wang 2017-10-05  164  

:: The code at line 153 was first introduced by commit
:: 037b21133e5367c833908db0226d77138ba4c5eb clk: mediatek: add 
devm_of_platform_populate() for MT7622 audsys

:: TO: Ryder Lee 
:: CC: Stephen Boyd 

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org


.config.gz
Description: application/gzip
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v2 1/3] iommu/uapi: Define uapi version and capabilities

2020-04-13 Thread Alex Williamson
On Mon, 13 Apr 2020 13:41:57 -0700
Jacob Pan  wrote:

> Hi All,
> 
> Just a gentle reminder, any feedback on the options I listed below? New
> ideas will be even better.
> 
> Christoph, does the explanation make sense to you? We do have the
> capability/flag based scheme for IOMMU API extension, the version is
> mainly used for size lookup. Compatibility checking is another use of
> the version, it makes checking easy when a vIOMMU is launched.
> 
> Thanks,
> 
> Jacob
> 
> On Thu, 2 Apr 2020 11:36:04 -0700
> Jacob Pan  wrote:
> 
> > On Wed, 1 Apr 2020 05:32:21 +
> > "Tian, Kevin"  wrote:
> >   
> > > > From: Jacob Pan 
> > > > Sent: Tuesday, March 31, 2020 11:55 PM
> > > > 
> > > > On Tue, 31 Mar 2020 06:06:38 +
> > > > "Tian, Kevin"  wrote:
> > > >   
> > > > > > From: Jacob Pan 
> > > > > > Sent: Tuesday, March 31, 2020 12:08 AM
> > > > > >
> > > > > > On Mon, 30 Mar 2020 05:40:40 +
> > > > > > "Tian, Kevin"  wrote:
> > > > > >  
> > > > > > > > From: Jacob Pan 
> > > > > > > > Sent: Saturday, March 28, 2020 7:54 AM
> > > > > > > >
> > > > > > > > On Fri, 27 Mar 2020 00:47:02 -0700
> > > > > > > > Christoph Hellwig  wrote:
> > > > > > > >  
> > > > > > > > > On Fri, Mar 27, 2020 at 02:49:55AM +, Tian, Kevin
> > > > > > > > > wrote:  
> > > > > > > > > > If those API calls are inter-dependent for composing a
> > > > > > > > > > feature (e.g. SVA), shouldn't we need a way to check
> > > > > > > > > > them together before exposing the feature to the
> > > > > > > > > > guest, e.g. through a iommu_get_uapi_capabilities
> > > > > > > > > > interface?  
> > > > > > > > >
> > > > > > > > > Yes, that makes sense.  The important bit is to have a
> > > > > > > > > capability flags and not version numbers.  
> > > > > > > >
> > > > > > > > The challenge is that there are two consumers in the
> > > > > > > > kernel for this. 1. VFIO only look for compatibility, and
> > > > > > > > size of each data struct such that it can copy_from_user.
> > > > > > > >
> > > > > > > > 2. IOMMU driver, the "real consumer" of the content.
> > > > > > > >
> > > > > > > > For 2, I agree and we do plan to use the capability flags
> > > > > > > > to check content and maintain backward compatibility etc.
> > > > > > > >
> > > > > > > > For VFIO, it is difficult to do size look up based on
> > > > > > > > capability flags.  
> > > > > > >
> > > > > > > Can you elaborate the difficulty in VFIO? if, as Christoph
> > > > > > > Hellwig pointed out, version number is already avoided
> > > > > > > everywhere, it is interesting to know whether this work
> > > > > > > becomes a real exception or just requires a different
> > > > > > > mindset.   
> > > > > > From VFIO p.o.v. the IOMMU UAPI data is opaque, it only needs
> > > > > > to do two things:
> > > > > > 1. is the UAPI compatible?
> > > > > > 2. what is the size to copy?
> > > > > >
> > > > > > If you look at the version number, this is really a "version
> > > > > > as size" lookup, as provided by the helper function in this
> > > > > > patch. An example can be the newly introduced clone3 syscall.
> > > > > > https://lwn.net/Articles/792628/
> > > > > > In clone3, new version must have new size. The slight
> > > > > > difference here is that, unlike clone3, we have multiple data
> > > > > > structures instead of a single struct clone_args {}. And each
> > > > > > struct has flags to enumerate its contents besides size.  
> > > > >
> > > > > Thanks for providing that link. However clone3 doesn't include a
> > > > > version field to do "version as size" lookup. Instead, as you
> > > > > said, it includes a size parameter which sounds like the option
> > > > > 3 (argsz) listed below.
> > > > >  
> > > > Right, there is no version in clone3. size = version. I view this
> > > > as a 1:1 lookup.
> > > >   
> > > > > >
> > > > > > Besides breaching data abstraction, if VFIO has to check IOMMU
> > > > > > flags to determine the sizes, it has many combinations.
> > > > > >
> > > > > > We also separate the responsibilities into two parts
> > > > > > 1. compatibility - version, size by VFIO
> > > > > > 2. sanity check - capability flags - by IOMMU  
> > > > >
> > > > > I feel argsz+flags approach can perfectly meet above
> > > > > requirement. The userspace set the size and flags for whatever
> > > > > capabilities it uses, and VFIO simply copies the parameters by
> > > > > size and pass to IOMMU for further sanity check. Of course the
> > > > > assumption is that we do provide an interface for userspace to
> > > > > enumerate all supported capabilities. 
> > > > You cannot trust user for argsz. the size to be copied from user
> > > > must be based on knowledge in kernel. That is why we have this
> > > > version to size lookup.
> > > > 
> > > > In VFIO, the size to copy is based on knowledge of each VFIO UAPI
> > > > structures and VFIO flags. But here the flags are IOMMU UAPI
> > > > flags. As you pointed out in another thread, VFIO is 

Re: [RFC PATCH 11/34] iommu: Split off default domain allocation from group assignment

2020-04-13 Thread Derrick, Jonathan
Hi Joerg,

On Tue, 2020-04-07 at 20:37 +0200, Joerg Roedel wrote:
> From: Joerg Roedel 
> 
> When a bus is initialized with iommu-ops, all devices on the bus are
> scanned and iommu-groups are allocated for them, and each groups will
> also get a default domain allocated.
> 
> Until now this happened as soon as the group was created and the first
> device added to it. When other devices with different default domain
> requirements were added to the group later on, the default domain was
> re-allocated, if possible.
> 
> This resulted in some back and forth and unnecessary allocations, so
> change the flow to defer default domain allocation until all devices
> have been added to their respective IOMMU groups.
> 
> The default domains are allocated for newly allocated groups after
> each device on the bus is handled and was probed by the IOMMU driver.
> 
> Signed-off-by: Joerg Roedel 
> ---
[snip]


I had to add the following for initial VMD support. The new PCIe domain
added on VMD endpoint probe didn't have the dev_iommu member set on the
VMD subdevices, which I'm guessing is due to probe_iommu_group already
having been run on the VMD endpoint's group prior to those subdevices
being added.

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 8a5e1ac328dd..ac1e4fb9bf48 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1577,6 +1577,9 @@ static int iommu_bus_notifier(struct notifier_block *nb,
if (action == BUS_NOTIFY_ADD_DEVICE) {
int ret;
 
+   if (!dev_iommu_get(dev))
+   return -ENOMEM;
+
ret = iommu_probe_device(dev);
return (ret) ? NOTIFY_DONE : NOTIFY_OK;
} else if (action == BUS_NOTIFY_REMOVED_DEVICE) {
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 05/10] iommu/ioasid: Create an IOASID set for host SVA use

2020-04-13 Thread Jacob Pan
Hi Jean,

Sorry for the delay, I have to do some research based on your feedback.
I also plan to document this in the next version.


On Tue, 7 Apr 2020 13:01:46 +0200
Jean-Philippe Brucker  wrote:

> On Mon, Apr 06, 2020 at 08:33:53AM -0700, Jacob Pan wrote:
> > Hi Jean,
> > 
> > On Wed, 1 Apr 2020 15:53:16 +0200
> > Jean-Philippe Brucker  wrote:
> >   
> > > On Wed, Mar 25, 2020 at 10:55:26AM -0700, Jacob Pan wrote:  
> > > > Bare metal SVA allocates IOASIDs for native process addresses.
> > > > This should be separated from VM allocated IOASIDs thus under
> > > > its own set.
> > > > 
> > > > This patch creates a system IOASID set with its quota set to
> > > > PID_MAX. This is a reasonable default in that SVM capable
> > > > devices can only bind to limited user processes.
> > > 
> > > Yes realistically there won't be more than PID_MAX_DEFAULT=0x8000
> > > bound address spaces. My machine uses a PID_MAX of 4 million
> > > though, so in theory more than 0x8000 processes may want a bond.  
> > Got it, I assume we can adjust the system set quota as necessary.
> >   
> > > On Arm the
> > > limit of shared contexts per VM is currently a little less than
> > > 0x1 (which is the number of CPU ASIDs).
> > >   
> > I guess shared contexts means shared address? then it makes sense
> > #IOASID < #ASID.  
> 
> Yes by shared contexts I mean shared address spaces. Theoretically
> #ASID < #IOASID for us, because the max ASID size is 16-bit.
> 
Got it.

> >   
> > > But quotas are only necessary for VMs, when the host shares the
> > > PASID space with them (which isn't a use-case for Arm systems as
> > > far as I know, each VM gets its own PASID space).  
> > Is there a host-guest PASID translation? or the PASID used by the
> > VM is physical PASID? When a page request comes in to SMMU, how
> > does it know the owner of the PASID if PASID range can overlap
> > between host and guest?  
> 
> We assign PCI functions to VMs, so Page Requests are routed with
> RID:PASID, not PASID alone. The SMMU finds the struct device
> associated with the RID, and submits the fault with
> iommu_report_device_fault(). If the VF is assigned to a VM, then the
> page request gets injected into the VM, otherwise it uses the host
> IOPF handler
> 
Got it, VM private PASID space works then.
For VM, the IOASID search is within the VM ioasid_set.
For SVA, the IOASID search is within host default set.
Should be faster than global search once we have per set xarray.
I guess the PASID table is per VM instead of per RID (device)? Sorry if
you already answered it before.


> > > Could we have quota-free IOASID sets for the host?
> > >   
> > Yes, perhaps just add a flag such that the set has its own
> > namespace. You mean have this quota-free IOASID set even co-exist
> > with VMs? I still don't get how PRQ works.
> > 
> > That is not the use case for VT-d in that we have to have
> > system-wide allocation for host PASIDs. We have enqcmd which can
> > take a PASID from the per task MSR and deliver to multiple devices,
> > so even though the PASID table is per device the PASID name space
> > must be global. 
> > > For the SMMU I'd like to allocate two sets, one SVA and one
> > > private for auxiliary domains, and I don't think giving either a
> > > quota makes much sense at the moment.  
> > I agree we don;t need the quota if we don't support guest SVA at the
> > same time.
> > 
> > So the sva set and aux_domain set PASIDs have their own
> > namespaces?  
> 
> They share the same PASID space, but they store different objects
> (mm_struct and context descriptor, respectively) so they need
> different ioasid_set tokens.
> 
Got it.

> >   
> > > There can be systems using only SVA and
> > > systems using only private PASIDs. I think it should be
> > > first-come-first-served until admins want a knob to define a
> > > policy themselves, based on cgroups for example.
> > >   
> > > > Signed-off-by: Jacob Pan 
> > > > ---
> > > >  drivers/iommu/intel-iommu.c | 8 +++-
> > > >  drivers/iommu/ioasid.c  | 9 +
> > > >  include/linux/ioasid.h  | 9 +
> > > >  3 files changed, 25 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/drivers/iommu/intel-iommu.c
> > > > b/drivers/iommu/intel-iommu.c index ec3fc121744a..af7a1ef7b31e
> > > > 100644 --- a/drivers/iommu/intel-iommu.c
> > > > +++ b/drivers/iommu/intel-iommu.c
> > > > @@ -3511,8 +3511,14 @@ static int __init init_dmars(void)
> > > > goto free_iommu;
> > > >  
> > > > /* PASID is needed for scalable mode irrespective to
> > > > SVM */
> > > > -   if (intel_iommu_sm)
> > > > +   if (intel_iommu_sm) {
> > > > ioasid_install_capacity(intel_pasid_max_id);
> > > > +   /* We should not run out of IOASIDs at boot */
> > > > +   if (ioasid_alloc_system_set(PID_MAX_DEFAULT)) {
> > > > +   pr_err("Failed to enable host PASID
> > > > allocator\n");
> > > > +   

Re: [PATCH v2 1/3] iommu/uapi: Define uapi version and capabilities

2020-04-13 Thread Jacob Pan
Hi All,

Just a gentle reminder, any feedback on the options I listed below? New
ideas will be even better.

Christoph, does the explanation make sense to you? We do have the
capability/flag based scheme for IOMMU API extension, the version is
mainly used for size lookup. Compatibility checking is another use of
the version, it makes checking easy when a vIOMMU is launched.

Thanks,

Jacob

On Thu, 2 Apr 2020 11:36:04 -0700
Jacob Pan  wrote:

> On Wed, 1 Apr 2020 05:32:21 +
> "Tian, Kevin"  wrote:
> 
> > > From: Jacob Pan 
> > > Sent: Tuesday, March 31, 2020 11:55 PM
> > > 
> > > On Tue, 31 Mar 2020 06:06:38 +
> > > "Tian, Kevin"  wrote:
> > > 
> > > > > From: Jacob Pan 
> > > > > Sent: Tuesday, March 31, 2020 12:08 AM
> > > > >
> > > > > On Mon, 30 Mar 2020 05:40:40 +
> > > > > "Tian, Kevin"  wrote:
> > > > >
> > > > > > > From: Jacob Pan 
> > > > > > > Sent: Saturday, March 28, 2020 7:54 AM
> > > > > > >
> > > > > > > On Fri, 27 Mar 2020 00:47:02 -0700
> > > > > > > Christoph Hellwig  wrote:
> > > > > > >
> > > > > > > > On Fri, Mar 27, 2020 at 02:49:55AM +, Tian, Kevin
> > > > > > > > wrote:
> > > > > > > > > If those API calls are inter-dependent for composing a
> > > > > > > > > feature (e.g. SVA), shouldn't we need a way to check
> > > > > > > > > them together before exposing the feature to the
> > > > > > > > > guest, e.g. through a iommu_get_uapi_capabilities
> > > > > > > > > interface?
> > > > > > > >
> > > > > > > > Yes, that makes sense.  The important bit is to have a
> > > > > > > > capability flags and not version numbers.
> > > > > > >
> > > > > > > The challenge is that there are two consumers in the
> > > > > > > kernel for this. 1. VFIO only look for compatibility, and
> > > > > > > size of each data struct such that it can copy_from_user.
> > > > > > >
> > > > > > > 2. IOMMU driver, the "real consumer" of the content.
> > > > > > >
> > > > > > > For 2, I agree and we do plan to use the capability flags
> > > > > > > to check content and maintain backward compatibility etc.
> > > > > > >
> > > > > > > For VFIO, it is difficult to do size look up based on
> > > > > > > capability flags.
> > > > > >
> > > > > > Can you elaborate the difficulty in VFIO? if, as Christoph
> > > > > > Hellwig pointed out, version number is already avoided
> > > > > > everywhere, it is interesting to know whether this work
> > > > > > becomes a real exception or just requires a different
> > > > > > mindset. 
> > > > > From VFIO p.o.v. the IOMMU UAPI data is opaque, it only needs
> > > > > to do two things:
> > > > > 1. is the UAPI compatible?
> > > > > 2. what is the size to copy?
> > > > >
> > > > > If you look at the version number, this is really a "version
> > > > > as size" lookup, as provided by the helper function in this
> > > > > patch. An example can be the newly introduced clone3 syscall.
> > > > > https://lwn.net/Articles/792628/
> > > > > In clone3, new version must have new size. The slight
> > > > > difference here is that, unlike clone3, we have multiple data
> > > > > structures instead of a single struct clone_args {}. And each
> > > > > struct has flags to enumerate its contents besides size.
> > > >
> > > > Thanks for providing that link. However clone3 doesn't include a
> > > > version field to do "version as size" lookup. Instead, as you
> > > > said, it includes a size parameter which sounds like the option
> > > > 3 (argsz) listed below.
> > > >
> > > Right, there is no version in clone3. size = version. I view this
> > > as a 1:1 lookup.
> > > 
> > > > >
> > > > > Besides breaching data abstraction, if VFIO has to check IOMMU
> > > > > flags to determine the sizes, it has many combinations.
> > > > >
> > > > > We also separate the responsibilities into two parts
> > > > > 1. compatibility - version, size by VFIO
> > > > > 2. sanity check - capability flags - by IOMMU
> > > >
> > > > I feel argsz+flags approach can perfectly meet above
> > > > requirement. The userspace set the size and flags for whatever
> > > > capabilities it uses, and VFIO simply copies the parameters by
> > > > size and pass to IOMMU for further sanity check. Of course the
> > > > assumption is that we do provide an interface for userspace to
> > > > enumerate all supported capabilities.   
> > > You cannot trust user for argsz. the size to be copied from user
> > > must be based on knowledge in kernel. That is why we have this
> > > version to size lookup.
> > > 
> > > In VFIO, the size to copy is based on knowledge of each VFIO UAPI
> > > structures and VFIO flags. But here the flags are IOMMU UAPI
> > > flags. As you pointed out in another thread, VFIO is one user.
> > 
> > If that is the case, can we let VFIO only copy its own UAPI fields
> > while simply passing the user pointer of IOMMU UAPI structure to
> > IOMMU driver for further size check and copy? Otherwise we are
> > entering a dead end that VFIO doesn't want to parse a structure
> > 

Re: [PATCH 25/28] mm: remove vmalloc_user_node_flags

2020-04-13 Thread Johannes Weiner
On Thu, Apr 09, 2020 at 03:25:03PM -0700, Andrii Nakryiko wrote:
> cc Johannes who suggested this API call originally

I forgot why we did it this way - probably just cruft begetting more
cruft. Either way, Christoph's cleanup makes this look a lot better.

> On Wed, Apr 8, 2020 at 5:03 AM Christoph Hellwig  wrote:
> >
> > Open code it in __bpf_map_area_alloc, which is the only caller.  Also
> > clean up __bpf_map_area_alloc to have a single vmalloc call with
> > slightly different flags instead of the current two different calls.
> >
> > For this to compile for the nommu case add a __vmalloc_node_range stub
> > to nommu.c.
> >
> > Signed-off-by: Christoph Hellwig 

Acked-by: Johannes Weiner 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v1 2/2] vfio/pci: Emulate PASID/PRI capability for VFs

2020-04-13 Thread Alex Williamson
On Thu, 9 Apr 2020 09:35:33 +0200
Jean-Philippe Brucker  wrote:

> On Wed, Apr 08, 2020 at 10:19:40AM -0600, Alex Williamson wrote:
> > On Tue, 7 Apr 2020 21:00:21 -0700
> > "Raj, Ashok"  wrote:
> >   
> > > Hi Alex
> > > 
> > > + Bjorn  
> > 
> >  + Don
> >   
> > > FWIW I can't understand why PCI SIG went different ways with ATS, 
> > > where its enumerated on PF and VF. But for PASID and PRI its only
> > > in PF. 
> > > 
> > > I'm checking with our internal SIG reps to followup on that.
> > > 
> > > On Tue, Apr 07, 2020 at 09:58:01AM -0600, Alex Williamson wrote:  
> > > > > Is there vendor guarantee that hidden registers will locate at the
> > > > > same offset between PF and VF config space? 
> > > > 
> > > > I'm not sure if the spec really precludes hidden registers, but the
> > > > fact that these registers are explicitly outside of the capability
> > > > chain implies they're only intended for device specific use, so I'd say
> > > > there are no guarantees about anything related to these registers.
> > > 
> > > As you had suggested in the other thread, we could consider
> > > using the same offset as in PF, but even that's a better guess
> > > still not reliable.
> > > 
> > > The other option is to maybe extend driver ops in the PF to expose
> > > where the offsets should be. Sort of adding the quirk in the 
> > > implementation. 
> > > 
> > > I'm not sure how prevalent are PASID and PRI in VF devices. If SIG is 
> > > resisting 
> > > making VF's first class citizen, we might ask them to add some verbiage
> > > to suggest leave the same offsets as PF open to help emulation software.  
> > 
> > Even if we know where to expose these capabilities on the VF, it's not
> > clear to me how we can actually virtualize the capability itself.  If
> > the spec defines, for example, an enable bit as r/w then software that
> > interacts with that register expects the bit is settable.  There's no
> > protocol for "try to set the bit and re-read it to see if the hardware
> > accepted it".  Therefore a capability with a fixed enable bit
> > representing the state of the PF, not settable by the VF, is
> > disingenuous to the spec.  
> 
> Would it be OK to implement a lock down mechanism for the PF PASID
> capability, preventing changes to the PF cap when the VF is in use by
> VFIO?  The emulation would still break the spec: since the PF cap would
> always be enabled the VF configuration bits would have no effect, but it
> seems preferable to having the Enable bit not enable anything.

I think we absolutely need some mechanism to make sure the PF driver
doesn't change the PASID enable bit while we're using it.  A PASID user
registration perhaps.  And yes, that doesn't necessarily map to being
able to actually disable PASID, but it sounds like Kevin and Ashok have
some ideas how the emulation could use the IOMMU settings to achieve
that more precisely.

> > If what we're trying to do is expose that PASID and PRI are enabled on
> > the PF to a VF driver, maybe duplicating the PF capabilities on the VF
> > without the ability to control it is not the right approach.  Maybe we
> > need new capabilities exposing these as slave features that cannot be
> > controlled?  We could define our own vendor capability for this, but of
> > course we have both the where to put it in config space issue, as well
> > as the issue of trying to push an ad-hoc standard.  vfio could expose
> > these as device features rather than emulating capabilities, but that
> > still leaves a big gap between vfio in the hypervisor and the driver in
> > the guest VM.  That might still help push the responsibility and policy
> > for how to expose it to the VM as a userspace problem though.  
> 
> Userspace might have more difficulty working around the issues mentioned
> in this thread. They would still need a guarantee that the PF PASID
> configuration doesn't change at runtime, and they wouldn't have the
> ability to talk to a vendor driver to figure out where to place the fake
> PASID capability.

Couldn't we do this with the DEVICE_FEATURE ioctl we just added?  A
user could set a PASID user or get a capability offset, where vfio-pci
would plumb these through to whatever PF/vendor driver provides that
interface, just as if it was implemented in the vfio-pci driver.  But
that still lets us leave the policy of inserting a capability and using
the IOMMU to implement the bits to the user/QEMU.  Thanks,

Alex

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v1 2/2] vfio/pci: Emulate PASID/PRI capability for VFs

2020-04-13 Thread Alex Williamson
On Mon, 13 Apr 2020 08:05:33 +
"Tian, Kevin"  wrote:

> > From: Tian, Kevin
> > Sent: Monday, April 13, 2020 3:55 PM
> >   
> > > From: Raj, Ashok 
> > > Sent: Monday, April 13, 2020 11:11 AM
> > >
> > > On Wed, Apr 08, 2020 at 10:19:40AM -0600, Alex Williamson wrote:  
> > > > On Tue, 7 Apr 2020 21:00:21 -0700
> > > > "Raj, Ashok"  wrote:
> > > >  
> > > > > Hi Alex
> > > > >
> > > > > + Bjorn  
> > > >
> > > >  + Don
> > > >  
> > > > > FWIW I can't understand why PCI SIG went different ways with ATS,
> > > > > where its enumerated on PF and VF. But for PASID and PRI its only
> > > > > in PF.
> > > > >
> > > > > I'm checking with our internal SIG reps to followup on that.
> > > > >
> > > > > On Tue, Apr 07, 2020 at 09:58:01AM -0600, Alex Williamson wrote:  
> > > > > > > Is there vendor guarantee that hidden registers will locate at the
> > > > > > > same offset between PF and VF config space?  
> > > > > >
> > > > > > I'm not sure if the spec really precludes hidden registers, but the
> > > > > > fact that these registers are explicitly outside of the capability
> > > > > > chain implies they're only intended for device specific use, so I'd 
> > > > > > say
> > > > > > there are no guarantees about anything related to these registers.  
> > > > >
> > > > > As you had suggested in the other thread, we could consider
> > > > > using the same offset as in PF, but even that's a better guess
> > > > > still not reliable.
> > > > >
> > > > > The other option is to maybe extend driver ops in the PF to expose
> > > > > where the offsets should be. Sort of adding the quirk in the
> > > > > implementation.
> > > > >
> > > > > I'm not sure how prevalent are PASID and PRI in VF devices. If SIG is 
> > > > >  
> > > resisting  
> > > > > making VF's first class citizen, we might ask them to add some 
> > > > > verbiage
> > > > > to suggest leave the same offsets as PF open to help emulation 
> > > > > software.  
> > > >
> > > > Even if we know where to expose these capabilities on the VF, it's not
> > > > clear to me how we can actually virtualize the capability itself.  If
> > > > the spec defines, for example, an enable bit as r/w then software that
> > > > interacts with that register expects the bit is settable.  There's no
> > > > protocol for "try to set the bit and re-read it to see if the hardware
> > > > accepted it".  Therefore a capability with a fixed enable bit
> > > > representing the state of the PF, not settable by the VF, is
> > > > disingenuous to the spec.  
> > >
> > > I think we are all in violent agreement. A lot of times the pci spec gets
> > > defined several years ahead of real products and no one remembers
> > > the justification on why they restricted things the way they did.
> > >
> > > Maybe someone early product wasn't quite exposing these features to the
> > > VF
> > > and hence the spec is bug compatible :-)
> > >  
> > > >
> > > > If what we're trying to do is expose that PASID and PRI are enabled on
> > > > the PF to a VF driver, maybe duplicating the PF capabilities on the VF
> > > > without the ability to control it is not the right approach.  Maybe we  
> > >
> > > As long as the capability enable is only provided when the PF has enabled
> > > the feature. Then it seems the hardware seems to do the right thing.
> > >
> > > Assume we expose PASID/PRI only when PF has enabled it. It will be the
> > > case since the PF driver needs to exist, and IOMMU would have set the
> > > PASID/PRI/ATS on PF.
> > >
> > > If the emulation is purely spoofing the capability. Once vIOMMU driver
> > > enables PASID, the context entries for the VF are completely independent
> > > from the PF context entries.
> > >
> > > vIOMMU would enable PASID, and we just spoof the PASID capability.
> > >
> > > If vIOMMU or guest for some reason does disable_pasid(), then the
> > > vIOMMU driver can disaable PASID on the VF context entries. So the VF
> > > although the capability is blanket enabled on PF, IOMMU gaurantees the
> > > transactions are blocked.
> > >
> > >
> > > In the interim, it seems like the intent of the virtual capability
> > > can be honored via help from the IOMMU for the controlling aspect..
> > >
> > > Did i miss anything?  
> > 
> > Above works for emulating the enable bit (under the assumption that
> > PF driver won't disable pasid when vf is assigned). However, there are
> > also "Execute permission enable" and "Privileged mode enable" bits in
> > PASID control registers. I don't know how those bits could be cleanly
> > emulated when the guest writes a value different from PF's...  
> 
> sent too quick. the IOMMU also includes control bits for allowing/
> blocking execute requests and supervisor requests. We can rely on 
> IOMMU to block those requests to emulate the disabled cases of
> all three control bits in the pasid cap.


So if the emulation of the PASID capability takes into account the
IOMMU configuration to back that emulation, shouldn't we do that
emulation in the 

Re: [PATCH v1 2/2] vfio/pci: Emulate PASID/PRI capability for VFs

2020-04-13 Thread Alex Williamson
On Sun, 12 Apr 2020 20:29:31 -0700
"Raj, Ashok"  wrote:

> Hi Alex
> 
> Going through the PCIe Spec, there seems a lot of such capabilities
> that are different between PF and VF. Some that make sense
> and some don't.
> 
> 
> On Sun, Apr 12, 2020 at 08:10:43PM -0700, Raj, Ashok wrote:
> >   
> > > 
> > > I agree though, I don't know why the SIG would preclude implementing
> > > per VF control of these features.  Thanks,
> > >   
> 
> For e.g. 
> 
> VF doesn't have I/O and Mem space enables, but has BME

VFs don't have I/O, so I/O enable is irrelevant.  The memory enable bit
is emulated, so it doesn't really do anything from the VM perspective.
The hypervisor could provide more emulation around this, but it hasn't
proven necessary.

> Interrupt Status

VFs don't have INTx, so this is irrelevant.

> Correctable Error Reporting
> Almost all of Device Control Register.

Are we doing anything to virtualize these for VFs?  I think we've
addressed access control to these for PFs, but I don't see that we try
to virtualize them for the VF.

> So it seems like there is a ton of them we have to deal with today for 
> VF's. How do we manage to emulate them without any support for them 
> in VF's? 

The memory enable bit is just access to the MMIO space of the device,
the hypervisor could choose to do more, but currently emulating the bit
itself is sufficient.  This doesn't really affect the device, just
access to the device.  The device control registers, I don't think
we've had a need to virtualize them yet and I think we'd run into many
of the same questions.  If your point is that there exists gaps in the
spec that make things difficult to virtualize, I won't argue with you
there.  MPS is a nearby one that's difficult to virtualize on the PF
since its setting needs to take entire communication channels into
account.

So far though we aren't inventing new capabilities to add to VF config
space and pretending they work, we're just stumbling on what the VF
exposes whether on bare metal or in a VM.  Thanks,

Alex

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/1] iommu/vt-d: use DMA domain for real DMA devices and subdevices

2020-04-13 Thread Derrick, Jonathan
On Sun, 2020-04-12 at 11:50 +0800, Daniel Drake wrote:
> Hi Jon,
> 
> Thanks for picking this up. Apologies for my absence here - I wasn't
> able to work on this recently, but I'm back again now.
> 
> On Fri, Apr 10, 2020 at 3:32 AM Jon Derrick  
> wrote:
> > This becomes problematic if the real DMA device and the subdevices have
> > different addressing capabilities and some require translation. Instead we 
> > can
> > put the real DMA dev and any subdevices on the DMA domain. This change 
> > assigns
> > subdevices to the DMA domain, and moves the real DMA device to the DMA 
> > domain
> > if necessary.
> 
> Have you tested this with the real DMA device in identity mode?
> It is not quite working for me. (Again, I'm not using VMD here, but
> have looked closely and believe we're working under the same
> constraints)

It does work for me when real DMA device starts in Identity, but my
'real DMA device' doesn't do the DMA. It just provides the source-id.

Does your 'real DMA device' do DMA?
I suppose that could be the reason. You wouldn't want to change the
domain on the live device using the method I proposed.

> 
> First, the real DMA device gets added to the group:
>  pci :00:17.0: Adding to iommu group 9
> (it's in IDENTITY mode here)
> 
> Then later, the first subdevice comes along, and these are the results:
>  pci 1:00:00.0: [8086:02d7] type 00 class 0x010601
>  pci 1:00:00.0: reg 0x10: [mem 0xae1a-0xae1a7fff]
>  pci 1:00:00.0: reg 0x14: [mem 0xae1a8000-0xae1a80ff]
>  pci 1:00:00.0: reg 0x18: [io  0x3090-0x3097]
>  pci 1:00:00.0: reg 0x1c: [io  0x3080-0x3083]
>  pci 1:00:00.0: reg 0x20: [io  0x3060-0x307f]
>  pci 1:00:00.0: reg 0x24: [mem 0xae10-0xae103fff]
>  pci 1:00:00.0: PME# supported from D3hot
>  pci 1:00:00.0: Adding to iommu group 9
>  pci 1:00:00.0: DMAR: Failed to get a private domain.
> 
> That final message is added by your patch and indicates that it's not working.
> 
> This is because the subdevice got added to the iommu group before the
> code you added tried to change to the DMA domain.
> 
> It first gets added to the group through this call path:
> intel_iommu_add_device
> -> iommu_group_get_for_dev
> -> iommu_group_add_device
> 
> Then, continuing within intel_iommu_add_device we get to the code you
> added, which tries to move the real DMA dev to DMA mode instead. It
> calls:
> 
>intel_iommu_request_dma_domain_for_dev
> -> iommu_request_dma_domain_for_dev
> -> request_default_domain_for_dev
> 
> Which fails here:
> /* Don't change mappings of existing devices */
> ret = -EBUSY;
> if (iommu_group_device_count(group) != 1)
> goto out;
> 
> because we already have 2 devices in the group (the real DMA dev, plus
> the subdevice we're in the process of handling now).
> 

You're right. I see the message too, but it still works for me.

> Next I'll look into the iommu group rework that Baolu mentioned.
> 
> Thanks,
> Daniel
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/1] iommu/vt-d: use DMA domain for real DMA devices and subdevices

2020-04-13 Thread Derrick, Jonathan
On Mon, 2020-04-13 at 10:48 +0800, Lu Baolu wrote:
> Hi Daniel,
> 
> On 2020/4/13 10:25, Daniel Drake wrote:
> > On Fri, Apr 10, 2020 at 9:22 AM Lu Baolu  wrote:
> > > This is caused by the fragile private domain implementation. We are in
> > > process of removing it by enhancing the iommu subsystem with per-group
> > > default domain.
> > > 
> > > https://www.spinics.net/lists/iommu/msg42976.html
> > > 
> > > So ultimately VMD subdevices should have their own per-device iommu data
> > > and support per-device dma ops.
> > 
> > Interesting. There's also this patchset you posted:
> > [PATCH 00/19] [PULL REQUEST] iommu/vt-d: patches for v5.7
> > https://lists.linuxfoundation.org/pipermail/iommu/2020-April/042967.html
> > (to be pushed out to 5.8)
> 
> Both are trying to solve a same problem.
> 
> I have sync'ed with Joerg. This patch set will be replaced with Joerg's
> proposal due to a race concern between domain switching and driver
> binding. I will rebase all vt-d patches in this set on top of Joerg's
> change.
> 
> Best regards,
> baolu
> 
Thanks Baolu. I'll pick this back up on top of the for-5.8 changes.


> > In there you have:
> > > iommu/vt-d: Don't force 32bit devices to uses DMA domain
> > which seems to clash with the approach being explored in this thread.
> > 
> > And:
> > > iommu/vt-d: Apply per-device dma_ops
> > This effectively solves the trip point that caused me to open these
> > discussions, where intel_map_page() -> iommu_need_mapping() would
> > incorrectly determine that a intel-iommu DMA mapping was needed for a
> > PCI subdevice running in identity mode. After this patch, a PCI
> > subdevice in identity mode uses the default system dma_ops and
> > completely avoids intel-iommu.
> > 
> > So that solves the issues I was looking at. Jon, you might want to
> > check if the problems you see are likewise solved for you by these
> > patches.
> > 
> > I didn't try Joerg's iommu group rework yet as it conflicts with those
> > patches above.
> > 
> > Daniel
> > 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 0/2] iommu/arm-smmu: Allow client devices to select direct mapping

2020-04-13 Thread Jordan Crouse
On Thu, Apr 09, 2020 at 04:31:24PM -0700, Matthias Kaehlcke wrote:
> On Tue, Feb 04, 2020 at 11:12:17PM +0530, Sai Prakash Ranjan wrote:
> > Hello Robin, Will
> > 
> > On 2020-01-22 17:18, Sai Prakash Ranjan wrote:
> > > This series allows drm devices to set a default identity
> > > mapping using iommu_request_dm_for_dev(). First patch is
> > > a cleanup to support other SoCs to call into QCOM specific
> > > implementation and preparation for second patch.
> > > Second patch sets the default identity domain for drm devices.
> > > 
> > > Jordan Crouse (1):
> > >   iommu/arm-smmu: Allow client devices to select direct mapping
> > > 
> > > Sai Prakash Ranjan (1):
> > >   iommu: arm-smmu-impl: Convert to a generic reset implementation
> > > 
> > >  drivers/iommu/arm-smmu-impl.c |  8 +++--
> > >  drivers/iommu/arm-smmu-qcom.c | 55 +--
> > >  drivers/iommu/arm-smmu.c  |  3 ++
> > >  drivers/iommu/arm-smmu.h  |  5 
> > >  4 files changed, 65 insertions(+), 6 deletions(-)
> > 
> > Any review comments?
> 
> Ping
> 
> What is the status of this series, is it ready to land or are any changes
> needed?
> 
> Thanks
> 
> Matthias

I think this is up in the air following the changes that Joerg suggested:
https://lists.linuxfoundation.org/pipermail/iommu/2020-April/043017.html

Jordan
-- 
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH] of_device: removed #include that caused a recursion in included headers

2020-04-13 Thread Hadar Gat
Both of_platform.h and of_device.h were included each other.
In of_device.h, removed unneeded #include to of_platform.h
and added include to of_platform.h in the files that needs it.

Signed-off-by: Hadar Gat 
---
 drivers/base/platform.c   | 1 +
 drivers/bus/vexpress-config.c | 1 +
 drivers/dma/at_hdmac.c| 1 +
 drivers/dma/stm32-dmamux.c| 1 +
 drivers/dma/ti/dma-crossbar.c | 1 +
 drivers/gpu/drm/msm/adreno/a6xx_gmu.c | 1 +
 drivers/gpu/drm/msm/hdmi/hdmi.c   | 1 +
 drivers/gpu/drm/msm/msm_drv.c | 1 +
 drivers/gpu/drm/rockchip/dw-mipi-dsi-rockchip.c   | 1 +
 drivers/gpu/drm/sun4i/sun4i_tcon.c| 1 +
 drivers/iio/adc/stm32-adc-core.c  | 1 +
 drivers/iio/adc/stm32-dfsdm-adc.c | 1 +
 drivers/iio/adc/stm32-dfsdm-core.c| 1 +
 drivers/iommu/tegra-smmu.c| 1 +
 drivers/memory/atmel-ebi.c| 1 +
 drivers/mfd/palmas.c  | 1 +
 drivers/mfd/ssbi.c| 1 +
 drivers/mtd/nand/raw/omap2.c  | 1 +
 drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c | 1 +
 drivers/net/ethernet/ti/cpsw.c| 1 +
 drivers/phy/tegra/xusb.c  | 1 +
 drivers/pinctrl/nomadik/pinctrl-nomadik.c | 1 +
 drivers/soc/samsung/exynos-pmu.c  | 1 +
 drivers/soc/sunxi/sunxi_sram.c| 1 +
 include/linux/of_device.h | 2 --
 lib/genalloc.c| 1 +
 26 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/drivers/base/platform.c b/drivers/base/platform.c
index 520..f549274b 100644
--- a/drivers/base/platform.c
+++ b/drivers/base/platform.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/bus/vexpress-config.c b/drivers/bus/vexpress-config.c
index ff70575..12b8b0b 100644
--- a/drivers/bus/vexpress-config.c
+++ b/drivers/bus/vexpress-config.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 
diff --git a/drivers/dma/at_hdmac.c b/drivers/dma/at_hdmac.c
index 672c73b..51337b4 100644
--- a/drivers/dma/at_hdmac.c
+++ b/drivers/dma/at_hdmac.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "at_hdmac_regs.h"
diff --git a/drivers/dma/stm32-dmamux.c b/drivers/dma/stm32-dmamux.c
index 3c89bd3..74695b9 100644
--- a/drivers/dma/stm32-dmamux.c
+++ b/drivers/dma/stm32-dmamux.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/dma/ti/dma-crossbar.c b/drivers/dma/ti/dma-crossbar.c
index f255056..10c6d23 100644
--- a/drivers/dma/ti/dma-crossbar.c
+++ b/drivers/dma/ti/dma-crossbar.c
@@ -10,6 +10,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #define TI_XBAR_DRA7   0
diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c 
b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
index c4e71ab..f523254 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "a6xx_gpu.h"
diff --git a/drivers/gpu/drm/msm/hdmi/hdmi.c b/drivers/gpu/drm/msm/hdmi/hdmi.c
index 737453b..5034d40 100644
--- a/drivers/gpu/drm/msm/hdmi/hdmi.c
+++ b/drivers/gpu/drm/msm/hdmi/hdmi.c
@@ -7,6 +7,7 @@
 
 #include 
 #include 
+#include 
 
 #include 
 #include "hdmi.h"
diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c
index 29295de..ddc9e85 100644
--- a/drivers/gpu/drm/msm/msm_drv.c
+++ b/drivers/gpu/drm/msm/msm_drv.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
diff --git a/drivers/gpu/drm/rockchip/dw-mipi-dsi-rockchip.c 
b/drivers/gpu/drm/rockchip/dw-mipi-dsi-rockchip.c
index 6e1270e..d038bae 100644
--- a/drivers/gpu/drm/rockchip/dw-mipi-dsi-rockchip.c
+++ b/drivers/gpu/drm/rockchip/dw-mipi-dsi-rockchip.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/gpu/drm/sun4i/sun4i_tcon.c 
b/drivers/gpu/drm/sun4i/sun4i_tcon.c
index 624437b..aa35757 100644
--- a/drivers/gpu/drm/sun4i/sun4i_tcon.c
+++ b/drivers/gpu/drm/sun4i/sun4i_tcon.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/iio/adc/stm32-adc-core.c b/drivers/iio/adc/stm32-adc-core.c
index 2df88d2..3dc3453 100644
--- a/drivers/iio/adc/stm32-adc-core.c
+++ b/drivers/iio/adc/stm32-adc-core.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/iio/adc/stm32-dfsdm-adc.c 
b/drivers/iio/adc/stm32-dfsdm-adc.c
index 76a60d9..e83848cb 100644
--- a/drivers/iio/adc/stm32-dfsdm-adc.c
+++ 

Re: [PATCH v2 1/3] iommu/virtio: Add topology description to virtio-iommu config space

2020-04-13 Thread Michael S. Tsirkin
On Fri, Feb 28, 2020 at 06:25:36PM +0100, Jean-Philippe Brucker wrote:
> diff --git a/include/uapi/linux/virtio_iommu.h 
> b/include/uapi/linux/virtio_iommu.h
> index 237e36a280cb..ec57d215086a 100644
> --- a/include/uapi/linux/virtio_iommu.h
> +++ b/include/uapi/linux/virtio_iommu.h
> @@ -16,6 +16,7 @@
>  #define VIRTIO_IOMMU_F_BYPASS3
>  #define VIRTIO_IOMMU_F_PROBE 4
>  #define VIRTIO_IOMMU_F_MMIO  5
> +#define VIRTIO_IOMMU_F_TOPOLOGY  6
>  
>  struct virtio_iommu_range_64 {
>   __le64  start;
> @@ -27,6 +28,12 @@ struct virtio_iommu_range_32 {
>   __le32  end;
>  };
>  
> +struct virtio_iommu_topo_config {
> + __le32  offset;

Any restrictions on offset? E.g. alignment?

> + __le32  num_items;
> + __le32  item_length;
> +};
> +
>  struct virtio_iommu_config {
>   /* Supported page sizes */
>   __le64  page_size_mask;
> @@ -36,6 +43,25 @@ struct virtio_iommu_config {
>   struct virtio_iommu_range_32domain_range;
>   /* Probe buffer size */
>   __le32  probe_size;
> + struct virtio_iommu_topo_config topo_config;
> +};
> +
> +#define VIRTIO_IOMMU_TOPO_PCI_RANGE  0x1
> +#define VIRTIO_IOMMU_TOPO_ENDPOINT   0x2
> +
> +struct virtio_iommu_topo_pci_range {
> + __le16  type;
> + __le16  hierarchy;
> + __le16  requester_start;
> + __le16  requester_end;
> + __le32  endpoint_start;
> +};
> +
> +struct virtio_iommu_topo_endpoint {
> + __le16  type;
> + __le16  reserved;
> + __le32  endpoint;
> + __le64  address;
>  };
>  
>  /* Request types */

As any UAPI change, this needs to be copied to virtio TC.

I believe an old version of QEMU patches was published there
but I don't think it was the latest one you tested against.

Description should preferably be added to spec too.

In partucular please add comments (in this header, too)
documenting the new fields, values and structures.

> -- 
> 2.25.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH] dt-bndings: iommu: renesas, ipmmu-vmsa: convert to json-schema

2020-04-13 Thread Yoshihiro Shimoda
Convert Renesas VMSA-Compatible IOMMU bindings documentation
to json-schema.

Signed-off-by: Yoshihiro Shimoda 
---
 .../bindings/iommu/renesas,ipmmu-vmsa.txt  | 73 --
 .../bindings/iommu/renesas,ipmmu-vmsa.yaml | 90 ++
 2 files changed, 90 insertions(+), 73 deletions(-)
 delete mode 100644 
Documentation/devicetree/bindings/iommu/renesas,ipmmu-vmsa.txt
 create mode 100644 
Documentation/devicetree/bindings/iommu/renesas,ipmmu-vmsa.yaml

diff --git a/Documentation/devicetree/bindings/iommu/renesas,ipmmu-vmsa.txt 
b/Documentation/devicetree/bindings/iommu/renesas,ipmmu-vmsa.txt
deleted file mode 100644
index 020d6f2..
--- a/Documentation/devicetree/bindings/iommu/renesas,ipmmu-vmsa.txt
+++ /dev/null
@@ -1,73 +0,0 @@
-* Renesas VMSA-Compatible IOMMU
-
-The IPMMU is an IOMMU implementation compatible with the ARM VMSA page tables.
-It provides address translation for bus masters outside of the CPU, each
-connected to the IPMMU through a port called micro-TLB.
-
-
-Required Properties:
-
-  - compatible: Must contain SoC-specific and generic entry below in case
-the device is compatible with the R-Car Gen2 VMSA-compatible IPMMU.
-
-- "renesas,ipmmu-r8a73a4" for the R8A73A4 (R-Mobile APE6) IPMMU.
-- "renesas,ipmmu-r8a7743" for the R8A7743 (RZ/G1M) IPMMU.
-- "renesas,ipmmu-r8a7744" for the R8A7744 (RZ/G1N) IPMMU.
-- "renesas,ipmmu-r8a7745" for the R8A7745 (RZ/G1E) IPMMU.
-- "renesas,ipmmu-r8a774a1" for the R8A774A1 (RZ/G2M) IPMMU.
-- "renesas,ipmmu-r8a774b1" for the R8A774B1 (RZ/G2N) IPMMU.
-- "renesas,ipmmu-r8a774c0" for the R8A774C0 (RZ/G2E) IPMMU.
-- "renesas,ipmmu-r8a7790" for the R8A7790 (R-Car H2) IPMMU.
-- "renesas,ipmmu-r8a7791" for the R8A7791 (R-Car M2-W) IPMMU.
-- "renesas,ipmmu-r8a7793" for the R8A7793 (R-Car M2-N) IPMMU.
-- "renesas,ipmmu-r8a7794" for the R8A7794 (R-Car E2) IPMMU.
-- "renesas,ipmmu-r8a7795" for the R8A7795 (R-Car H3) IPMMU.
-- "renesas,ipmmu-r8a7796" for the R8A7796 (R-Car M3-W) IPMMU.
-- "renesas,ipmmu-r8a77965" for the R8A77965 (R-Car M3-N) IPMMU.
-- "renesas,ipmmu-r8a77970" for the R8A77970 (R-Car V3M) IPMMU.
-- "renesas,ipmmu-r8a77980" for the R8A77980 (R-Car V3H) IPMMU.
-- "renesas,ipmmu-r8a77990" for the R8A77990 (R-Car E3) IPMMU.
-- "renesas,ipmmu-r8a77995" for the R8A77995 (R-Car D3) IPMMU.
-- "renesas,ipmmu-vmsa" for generic R-Car Gen2 or RZ/G1 VMSA-compatible
-  IPMMU.
-
-  - reg: Base address and size of the IPMMU registers.
-  - interrupts: Specifiers for the MMU fault interrupts. For instances that
-support secure mode two interrupts must be specified, for non-secure and
-secure mode, in that order. For instances that don't support secure mode a
-single interrupt must be specified. Not required for cache IPMMUs.
-
-  - #iommu-cells: Must be 1.
-
-Optional properties:
-
-  - renesas,ipmmu-main: reference to the main IPMMU instance in two cells.
-The first cell is a phandle to the main IPMMU and the second cell is
-the interrupt bit number associated with the particular cache IPMMU device.
-The interrupt bit number needs to match the main IPMMU IMSSTR register.
-Only used by cache IPMMU instances.
-
-
-Each bus master connected to an IPMMU must reference the IPMMU in its device
-node with the following property:
-
-  - iommus: A reference to the IPMMU in two cells. The first cell is a phandle
-to the IPMMU and the second cell the number of the micro-TLB that the
-device is connected to.
-
-
-Example: R8A7791 IPMMU-MX and VSP1-D0 bus master
-
-   ipmmu_mx: mmu@fe951000 {
-   compatible = "renasas,ipmmu-r8a7791", "renasas,ipmmu-vmsa";
-   reg = <0 0xfe951000 0 0x1000>;
-   interrupts = <0 222 IRQ_TYPE_LEVEL_HIGH>,
-<0 221 IRQ_TYPE_LEVEL_HIGH>;
-   #iommu-cells = <1>;
-   };
-
-   vsp@fe928000 {
-   ...
-   iommus = <_mx 13>;
-   ...
-   };
diff --git a/Documentation/devicetree/bindings/iommu/renesas,ipmmu-vmsa.yaml 
b/Documentation/devicetree/bindings/iommu/renesas,ipmmu-vmsa.yaml
new file mode 100644
index ..3820b10
--- /dev/null
+++ b/Documentation/devicetree/bindings/iommu/renesas,ipmmu-vmsa.yaml
@@ -0,0 +1,90 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/iommu/renesas,ipmmu-vmsa.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Renesas VMSA-Compatible IOMMU
+
+maintainers:
+  - Yoshihiro Shimoda 
+
+description:
+  The IPMMU is an IOMMU implementation compatible with the ARM VMSA page 
tables.
+  It provides address translation for bus masters outside of the CPU, each
+  connected to the IPMMU through a port called micro-TLB.
+
+properties:
+  compatible:
+oneOf:
+  - items:
+  - enum:
+  - renesas,ipmmu-r8a7743  # RZ/G1M
+ 

RE: [PATCH v1 2/2] vfio/pci: Emulate PASID/PRI capability for VFs

2020-04-13 Thread Tian, Kevin
> From: Tian, Kevin
> Sent: Monday, April 13, 2020 3:55 PM
> 
> > From: Raj, Ashok 
> > Sent: Monday, April 13, 2020 11:11 AM
> >
> > On Wed, Apr 08, 2020 at 10:19:40AM -0600, Alex Williamson wrote:
> > > On Tue, 7 Apr 2020 21:00:21 -0700
> > > "Raj, Ashok"  wrote:
> > >
> > > > Hi Alex
> > > >
> > > > + Bjorn
> > >
> > >  + Don
> > >
> > > > FWIW I can't understand why PCI SIG went different ways with ATS,
> > > > where its enumerated on PF and VF. But for PASID and PRI its only
> > > > in PF.
> > > >
> > > > I'm checking with our internal SIG reps to followup on that.
> > > >
> > > > On Tue, Apr 07, 2020 at 09:58:01AM -0600, Alex Williamson wrote:
> > > > > > Is there vendor guarantee that hidden registers will locate at the
> > > > > > same offset between PF and VF config space?
> > > > >
> > > > > I'm not sure if the spec really precludes hidden registers, but the
> > > > > fact that these registers are explicitly outside of the capability
> > > > > chain implies they're only intended for device specific use, so I'd 
> > > > > say
> > > > > there are no guarantees about anything related to these registers.
> > > >
> > > > As you had suggested in the other thread, we could consider
> > > > using the same offset as in PF, but even that's a better guess
> > > > still not reliable.
> > > >
> > > > The other option is to maybe extend driver ops in the PF to expose
> > > > where the offsets should be. Sort of adding the quirk in the
> > > > implementation.
> > > >
> > > > I'm not sure how prevalent are PASID and PRI in VF devices. If SIG is
> > resisting
> > > > making VF's first class citizen, we might ask them to add some verbiage
> > > > to suggest leave the same offsets as PF open to help emulation software.
> > >
> > > Even if we know where to expose these capabilities on the VF, it's not
> > > clear to me how we can actually virtualize the capability itself.  If
> > > the spec defines, for example, an enable bit as r/w then software that
> > > interacts with that register expects the bit is settable.  There's no
> > > protocol for "try to set the bit and re-read it to see if the hardware
> > > accepted it".  Therefore a capability with a fixed enable bit
> > > representing the state of the PF, not settable by the VF, is
> > > disingenuous to the spec.
> >
> > I think we are all in violent agreement. A lot of times the pci spec gets
> > defined several years ahead of real products and no one remembers
> > the justification on why they restricted things the way they did.
> >
> > Maybe someone early product wasn't quite exposing these features to the
> > VF
> > and hence the spec is bug compatible :-)
> >
> > >
> > > If what we're trying to do is expose that PASID and PRI are enabled on
> > > the PF to a VF driver, maybe duplicating the PF capabilities on the VF
> > > without the ability to control it is not the right approach.  Maybe we
> >
> > As long as the capability enable is only provided when the PF has enabled
> > the feature. Then it seems the hardware seems to do the right thing.
> >
> > Assume we expose PASID/PRI only when PF has enabled it. It will be the
> > case since the PF driver needs to exist, and IOMMU would have set the
> > PASID/PRI/ATS on PF.
> >
> > If the emulation is purely spoofing the capability. Once vIOMMU driver
> > enables PASID, the context entries for the VF are completely independent
> > from the PF context entries.
> >
> > vIOMMU would enable PASID, and we just spoof the PASID capability.
> >
> > If vIOMMU or guest for some reason does disable_pasid(), then the
> > vIOMMU driver can disaable PASID on the VF context entries. So the VF
> > although the capability is blanket enabled on PF, IOMMU gaurantees the
> > transactions are blocked.
> >
> >
> > In the interim, it seems like the intent of the virtual capability
> > can be honored via help from the IOMMU for the controlling aspect..
> >
> > Did i miss anything?
> 
> Above works for emulating the enable bit (under the assumption that
> PF driver won't disable pasid when vf is assigned). However, there are
> also "Execute permission enable" and "Privileged mode enable" bits in
> PASID control registers. I don't know how those bits could be cleanly
> emulated when the guest writes a value different from PF's...

sent too quick. the IOMMU also includes control bits for allowing/
blocking execute requests and supervisor requests. We can rely on 
IOMMU to block those requests to emulate the disabled cases of
all three control bits in the pasid cap.

Thanks
Kevin

> 
> Similar problem also exists when talking about PRI emulation, e.g.
> to enable PRI the software usually waits until the 'stopped' bit
> is set (indicating all previously issued requests have completed). How
> to emulate this bit accurately when one guest toggles the enable bit
> while the PF and other VFs are actively issuing page requests through
> the shared page request interface? from pcie spec I didn't find a way
> to catch when all previously-issued 

RE: [PATCH v1 2/2] vfio/pci: Emulate PASID/PRI capability for VFs

2020-04-13 Thread Tian, Kevin
> From: Raj, Ashok 
> Sent: Monday, April 13, 2020 11:11 AM
> 
> On Wed, Apr 08, 2020 at 10:19:40AM -0600, Alex Williamson wrote:
> > On Tue, 7 Apr 2020 21:00:21 -0700
> > "Raj, Ashok"  wrote:
> >
> > > Hi Alex
> > >
> > > + Bjorn
> >
> >  + Don
> >
> > > FWIW I can't understand why PCI SIG went different ways with ATS,
> > > where its enumerated on PF and VF. But for PASID and PRI its only
> > > in PF.
> > >
> > > I'm checking with our internal SIG reps to followup on that.
> > >
> > > On Tue, Apr 07, 2020 at 09:58:01AM -0600, Alex Williamson wrote:
> > > > > Is there vendor guarantee that hidden registers will locate at the
> > > > > same offset between PF and VF config space?
> > > >
> > > > I'm not sure if the spec really precludes hidden registers, but the
> > > > fact that these registers are explicitly outside of the capability
> > > > chain implies they're only intended for device specific use, so I'd say
> > > > there are no guarantees about anything related to these registers.
> > >
> > > As you had suggested in the other thread, we could consider
> > > using the same offset as in PF, but even that's a better guess
> > > still not reliable.
> > >
> > > The other option is to maybe extend driver ops in the PF to expose
> > > where the offsets should be. Sort of adding the quirk in the
> > > implementation.
> > >
> > > I'm not sure how prevalent are PASID and PRI in VF devices. If SIG is
> resisting
> > > making VF's first class citizen, we might ask them to add some verbiage
> > > to suggest leave the same offsets as PF open to help emulation software.
> >
> > Even if we know where to expose these capabilities on the VF, it's not
> > clear to me how we can actually virtualize the capability itself.  If
> > the spec defines, for example, an enable bit as r/w then software that
> > interacts with that register expects the bit is settable.  There's no
> > protocol for "try to set the bit and re-read it to see if the hardware
> > accepted it".  Therefore a capability with a fixed enable bit
> > representing the state of the PF, not settable by the VF, is
> > disingenuous to the spec.
> 
> I think we are all in violent agreement. A lot of times the pci spec gets
> defined several years ahead of real products and no one remembers
> the justification on why they restricted things the way they did.
> 
> Maybe someone early product wasn't quite exposing these features to the
> VF
> and hence the spec is bug compatible :-)
> 
> >
> > If what we're trying to do is expose that PASID and PRI are enabled on
> > the PF to a VF driver, maybe duplicating the PF capabilities on the VF
> > without the ability to control it is not the right approach.  Maybe we
> 
> As long as the capability enable is only provided when the PF has enabled
> the feature. Then it seems the hardware seems to do the right thing.
> 
> Assume we expose PASID/PRI only when PF has enabled it. It will be the
> case since the PF driver needs to exist, and IOMMU would have set the
> PASID/PRI/ATS on PF.
> 
> If the emulation is purely spoofing the capability. Once vIOMMU driver
> enables PASID, the context entries for the VF are completely independent
> from the PF context entries.
> 
> vIOMMU would enable PASID, and we just spoof the PASID capability.
> 
> If vIOMMU or guest for some reason does disable_pasid(), then the
> vIOMMU driver can disaable PASID on the VF context entries. So the VF
> although the capability is blanket enabled on PF, IOMMU gaurantees the
> transactions are blocked.
> 
> 
> In the interim, it seems like the intent of the virtual capability
> can be honored via help from the IOMMU for the controlling aspect..
> 
> Did i miss anything?

Above works for emulating the enable bit (under the assumption that 
PF driver won't disable pasid when vf is assigned). However, there are 
also "Execute permission enable" and "Privileged mode enable" bits in 
PASID control registers. I don't know how those bits could be cleanly 
emulated when the guest writes a value different from PF's...

Similar problem also exists when talking about PRI emulation, e.g. 
to enable PRI the software usually waits until the 'stopped' bit
is set (indicating all previously issued requests have completed). How
to emulate this bit accurately when one guest toggles the enable bit
while the PF and other VFs are actively issuing page requests through
the shared page request interface? from pcie spec I didn't find a way
to catch when all previously-issued requests from a specific VF have
completed. Can a conservative big-enough timeout value help here?
I don't know... similar puzzle also exists for emulating the 'reset'
control bit which is supposed to clear the pending request state for
the whole page request interface.

I feel the main problem in pcie spec is that, while they invent SR-IOV
to address I/O virtualization requirement (where strict isolation is
required), they blurred the boundary by leaving some shared resource 
cross PF and VFs which imply