Re: [PATCH v2 00/15] vfio: expose virtual Shared Virtual Addressing to VMs

2020-06-16 Thread Peter Xu
On Tue, Jun 16, 2020 at 04:49:28PM +0100, Stefan Hajnoczi wrote:
> Isolation between applications is preserved but there is no isolation
> between the device and the application itself. The application needs to
> trust the device.
> 
> Examples:
> 
> 1. The device can snoop secret data from readable pages in the
>application's virtual memory space.
> 
> 2. The device can gain arbitrary execution on the CPU by overwriting
>control flow addresses (e.g. function pointers, stack return
>addresses) in writable pages.

To me, SVA seems to be that "middle layer" of secure where it's not as safe as
VFIO_IOMMU_MAP_DMA which has buffer level granularity of control (but of course
we pay overhead on buffer setups and on-the-fly translations), however it's far
better than DMA with no IOMMU which can ruin the whole host/guest, because
after all we do a lot of isolations as process based.

IMHO it's the same as when we see a VM (or the QEMU process) as a whole along
with the guest code.  In some cases we don't care if the guest did some bad
things to mess up with its own QEMU process.  It is still ideal if we can even
stop the guest from doing so, but when it's not easy to do it the ideal way, we
just lower the requirement to not spread the influence to the host and other
VMs.

Thanks,

-- 
Peter Xu

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 2/4] iommu/vt-d: Add first level page table interfaces

2019-09-28 Thread Peter Xu
On Sat, Sep 28, 2019 at 04:23:16PM +0800, Lu Baolu wrote:
> Hi Peter,
> 
> On 9/27/19 1:34 PM, Peter Xu wrote:
> > Hi, Baolu,
> > 
> > On Fri, Sep 27, 2019 at 10:27:24AM +0800, Lu Baolu wrote:
> > > > > > > + spin_lock(&(domain)->page_table_lock);  
> > > > > > > \
> > > > > > 
> > > > > > Is this intended to lock here instead of taking the lock during the
> > > > > > whole page table walk?  Is it safe?
> > > > > > 
> > > > > > Taking the example where nm==PTE: when we reach here how do we
> > > > > > guarantee that the PMD page that has this PTE is still valid?
> > > > > 
> > > > > We will always keep the non-leaf pages in the table,
> > > > 
> > > > I see.  Though, could I ask why?  It seems to me that the existing 2nd
> > > > level page table does not keep these when unmap, and it's not even use
> > > > locking at all by leveraging cmpxchg()?
> > > 
> > > I still need some time to understand how cmpxchg() solves the race issue
> > > when reclaims pages. For example.
> > > 
> > > Thread A  Thread B
> > > -A1: check all PTE's empty-B1: up-level PDE valid
> > > -A2: clear the up-level PDE
> > > -A3: reclaim the page -B2: populate the PTEs
> > > 
> > > Both (A1,A2) and (B1,B2) should be atomic. Otherwise, race could happen.
> > 
> > I'm not sure of this, but IMHO it is similarly because we need to
> > allocate the iova ranges from iova allocator first, so thread A (who's
> > going to unmap pages) and thread B (who's going to map new pages)
> > should never have collapsed regions if happening concurrently.  I'm
> 
> Although they don't collapse, they might share a same pmd entry. If A
> cleared the pmd entry and B goes ahead with populating the pte's. It
> will crash.

My understanding is that if A was not owning all the pages on that PMD
entry then it will never free the page that was backing that PMD
entry.  Please refer to the code in dma_pte_clear_level() where it
has:

/* If range covers entire pagetable, free it */
if (start_pfn <= level_pfn &&
last_pfn >= level_pfn + level_size(level) - 1) {
...
    } else {
...
}

Note that when going into the else block, the PMD won't be freed but
only the PTEs that upon the PMD will be cleared.

In the case you mentioned above, IMHO it should go into that else
block.  Say, thread A must not contain the whole range of that PMD
otherwise thread B won't get allocated with pages within that range
covered by the same PMD.

Thanks,

-- 
Peter Xu


Re: [RFC PATCH 2/4] iommu/vt-d: Add first level page table interfaces

2019-09-26 Thread Peter Xu
Hi, Baolu,

On Fri, Sep 27, 2019 at 10:27:24AM +0800, Lu Baolu wrote:
> > > > > + spin_lock(&(domain)->page_table_lock);  
> > > > > \
> > > > 
> > > > Is this intended to lock here instead of taking the lock during the
> > > > whole page table walk?  Is it safe?
> > > > 
> > > > Taking the example where nm==PTE: when we reach here how do we
> > > > guarantee that the PMD page that has this PTE is still valid?
> > > 
> > > We will always keep the non-leaf pages in the table,
> > 
> > I see.  Though, could I ask why?  It seems to me that the existing 2nd
> > level page table does not keep these when unmap, and it's not even use
> > locking at all by leveraging cmpxchg()?
> 
> I still need some time to understand how cmpxchg() solves the race issue
> when reclaims pages. For example.
> 
> Thread A  Thread B
> -A1: check all PTE's empty-B1: up-level PDE valid
> -A2: clear the up-level PDE
> -A3: reclaim the page -B2: populate the PTEs
> 
> Both (A1,A2) and (B1,B2) should be atomic. Otherwise, race could happen.

I'm not sure of this, but IMHO it is similarly because we need to
allocate the iova ranges from iova allocator first, so thread A (who's
going to unmap pages) and thread B (who's going to map new pages)
should never have collapsed regions if happening concurrently.  I'm
referring to intel_unmap() in which we won't free the iova region
before domain_unmap() completes (which should cover the whole process
of A1-A3) so the same iova range to be unmapped won't be allocated to
any new pages in some other thread.

There's also a hint in domain_unmap():

  /* we don't need lock here; nobody else touches the iova range */

> 
> Actually, the iova allocator always packs IOVA ranges close to the top
> of the address space. This results in requiring a minimal number of
> pages to map the allocated IOVA ranges, which makes memory onsumption
> by IOMMU page tables tolerable. Hence, we don't need to reclaim the
> pages until the whole page table is about to tear down. The real data
> on my test machine also improves this.

Do you mean you have run the code with a 1st-level-supported IOMMU
hardware?  IMHO any data point would be good to be in the cover letter
as reference.

[...]

> > > > > +static struct page *
> > > > > +mmunmap_pte_range(struct dmar_domain *domain, pmd_t *pmd,
> > > > > +   unsigned long addr, unsigned long end,
> > > > > +   struct page *freelist, bool reclaim)
> > > > > +{
> > > > > + int i;
> > > > > + unsigned long start;
> > > > > + pte_t *pte, *first_pte;
> > > > > +
> > > > > + start = addr;
> > > > > + pte = pte_offset_kernel(pmd, addr);
> > > > > + first_pte = pte;
> > > > > + do {
> > > > > + set_pte(pte, __pte(0));
> > > > > + } while (pte++, addr += PAGE_SIZE, addr != end);
> > > > > +
> > > > > + domain_flush_cache(domain, first_pte, (void *)pte - (void 
> > > > > *)first_pte);
> > > > > +
> > > > > + /* Add page to free list if all entries are empty. */
> > > > > + if (reclaim) {
> > > > 
> > > > Shouldn't we know whether to reclaim if with (addr, end) specified as
> > > > long as they cover the whole range of this PMD?
> > > 
> > > Current policy is that we don't reclaim any pages until the whole page
> > > table will be torn down.
> > 
> > Ah OK.  But I saw that you're passing in relaim==!start_addr.
> > Shouldn't that errornously trigger if one wants to unmap the 1st page
> > as well even if not the whole address space?
> 
> IOVA 0 is assumed to be reserved by the allocator. Otherwise, we have no
> means to check whether a IOVA is valid.

Is this an assumption of the allocator?  Could that change in the future?

IMHO that's not necessary if so, after all it's as simple as replacing
(!start_addr) with (start == 0 && end == END).  I see that in
domain_unmap() it has a similar check when freeing pgd:

  if (start_pfn == 0 && last_pfn == DOMAIN_MAX_PFN(domain->gaw))

Thanks,

-- 
Peter Xu


Re: [RFC PATCH 2/4] iommu/vt-d: Add first level page table interfaces

2019-09-25 Thread Peter Xu
LE));\
> > 
> > It seems to me that PV could trap calls to set_pte().  Then these
> > could also be trapped by e.g. Xen?  Are these traps needed?  Is there
> > side effect?  I'm totally not familiar with this, but just ask aloud...
> 
> Good catch. But I don't think a vIOMMU could get a chance to run in a PV
> environment. I might miss something?

I don't know... Is there reason to not allow a Xen guest to use 1st
level mapping?

While on the other side... If the PV interface will never be used,
then could native_set_##nm() be used directly?

[...]

> > > +static struct page *
> > > +mmunmap_pte_range(struct dmar_domain *domain, pmd_t *pmd,
> > > +   unsigned long addr, unsigned long end,
> > > +   struct page *freelist, bool reclaim)
> > > +{
> > > + int i;
> > > + unsigned long start;
> > > + pte_t *pte, *first_pte;
> > > +
> > > + start = addr;
> > > + pte = pte_offset_kernel(pmd, addr);
> > > + first_pte = pte;
> > > + do {
> > > + set_pte(pte, __pte(0));
> > > + } while (pte++, addr += PAGE_SIZE, addr != end);
> > > +
> > > + domain_flush_cache(domain, first_pte, (void *)pte - (void *)first_pte);
> > > +
> > > + /* Add page to free list if all entries are empty. */
> > > + if (reclaim) {
> > 
> > Shouldn't we know whether to reclaim if with (addr, end) specified as
> > long as they cover the whole range of this PMD?
> 
> Current policy is that we don't reclaim any pages until the whole page
> table will be torn down.

Ah OK.  But I saw that you're passing in relaim==!start_addr.
Shouldn't that errornously trigger if one wants to unmap the 1st page
as well even if not the whole address space?

> The gain is that we don't have to use a
> spinlock when map/unmap a pmd entry anymore.

So this question should also related to above on the locking - have
you thought about using the same way (IIUC) as the 2nd level page
table to use cmpxchg()?  AFAIU that does not need any lock?

For me it's perfectly fine to use a lock at least for initial version,
I just want to know the considerations behind in case I missed
anything important.

Thanks,

-- 
Peter Xu


Re: [RFC PATCH 0/4] Use 1st-level for DMA remapping in guest

2019-09-25 Thread Peter Xu
On Wed, Sep 25, 2019 at 08:02:23AM +, Tian, Kevin wrote:
> > From: Peter Xu [mailto:pet...@redhat.com]
> > Sent: Wednesday, September 25, 2019 3:45 PM
> > 
> > On Wed, Sep 25, 2019 at 07:21:51AM +, Tian, Kevin wrote:
> > > > From: Peter Xu [mailto:pet...@redhat.com]
> > > > Sent: Wednesday, September 25, 2019 2:57 PM
> > > >
> > > > On Wed, Sep 25, 2019 at 10:48:32AM +0800, Lu Baolu wrote:
> > > > > Hi Kevin,
> > > > >
> > > > > On 9/24/19 3:00 PM, Tian, Kevin wrote:
> > > > > > > > >   '---'
> > > > > > > > >   '---'
> > > > > > > > >
> > > > > > > > > This patch series only aims to achieve the first goal, a.k.a 
> > > > > > > > > using
> > > > > > first goal? then what are other goals? I didn't spot such 
> > > > > > information.
> > > > > >
> > > > >
> > > > > The overall goal is to use IOMMU nested mode to avoid shadow page
> > > > table
> > > > > and VMEXIT when map an gIOVA. This includes below 4 steps (maybe
> > not
> > > > > accurate, but you could get the point.)
> > > > >
> > > > > 1) GIOVA mappings over 1st-level page table;
> > > > > 2) binding vIOMMU 1st level page table to the pIOMMU;
> > > > > 3) using pIOMMU second level for GPA->HPA translation;
> > > > > 4) enable nested (a.k.a. dual stage) translation in host.
> > > > >
> > > > > This patch set aims to achieve 1).
> > > >
> > > > Would it make sense to use 1st level even for bare-metal to replace
> > > > the 2nd level?
> > > >
> > > > What I'm thinking is the DPDK apps - they have MMU page table already
> > > > there for the huge pages, then if they can use 1st level as the
> > > > default device page table then it even does not need to map, because
> > > > it can simply bind the process root page table pointer to the 1st
> > > > level page root pointer of the device contexts that it uses.
> > > >
> > >
> > > Then you need bear with possible page faults from using CPU page
> > > table, while most devices don't support it today.
> > 
> > Right, I was just thinking aloud.  After all neither do we have IOMMU
> > hardware to support 1st level (or am I wrong?)...  It's just that when
> 
> You are right. Current VT-d supports only 2nd level.
> 
> > the 1st level is ready it should sound doable because IIUC PRI should
> > be always with the 1st level support no matter on IOMMU side or the
> > device side?
> 
> No. PRI is not tied to 1st or 2nd level. Actually from device p.o.v, it's
> just a protocol to trigger page fault, but the device doesn't care whether
> the page fault is on 1st or 2nd level in the IOMMU side. The only
> relevant part is that a PRI request can have PASID tagged or cleared.
> When it's tagged with PASID, the IOMMU will locate the translation
> table under the given PASID (either 1st or 2nd level is fine, according
> to PASID entry setting). When no PASID is included, the IOMMU locates
> the translation from default entry (e.g. PASID#0 or any PASID contained
> in RID2PASID in VT-d).
> 
> Your knowledge happened to be correct in deprecated ECS mode. At
> that time, there is only one 2nd level per context entry which doesn't
> support page fault, and there is only one 1st level per PASID entry which
> supports page fault. Then PRI could be indirectly connected to 1st level,
> but this just changed with new scalable mode.
> 
> Another note is that the PRI capability only indicates that a device is
> capable of handling page faults, but not that a device can tolerate
> page fault for any of its DMA access. If the latter is fasle, using CPU 
> page table for DPDK usage is still risky (and specific to device behavior)
> 
> > 
> > I'm actually not sure about whether my understanding here is
> > correct... I thought the pasid binding previously was only for some
> > vendor kernel drivers but not a general thing to userspace.  I feel
> > like that should be doable in the future once we've got some new
> > syscall interface ready to deliver 1st level page table (e.g., via
> > vfio?) then applications like DPDK seems to be able to use that too
> > even directly via bare metal.
> > 
> 
> using 1st level for userspace is different from supporting DMA page
> fault in userspace. The former is purely about which structure to
> keep the mapping. I think we may do the same thing for both bare
> metal and guest (using 2nd level only for GPA when nested is enabled
> on the IOMMU). But reusing CPU page table for userspace is more
> tricky. :-)

Yes I should have mixed up the 1st level page table and PRI a bit, and
after all my initial question should be irrelevant to this series as
well so it's already a bit out of topic (sorry for that).

And, thanks for explaining these. :)

-- 
Peter Xu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 2/4] iommu/vt-d: Add first level page table interfaces

2019-09-25 Thread Peter Xu
On Wed, Sep 25, 2019 at 07:32:48AM +, Tian, Kevin wrote:
> > From: Lu Baolu [mailto:baolu...@linux.intel.com]
> > Sent: Wednesday, September 25, 2019 2:52 PM
> > 
> > Hi Peter and Kevin,
> > 
> > On 9/25/19 1:24 PM, Peter Xu wrote:
> > > On Wed, Sep 25, 2019 at 04:38:31AM +0000, Tian, Kevin wrote:
> > >>> From: Peter Xu [mailto:pet...@redhat.com]
> > >>> Sent: Wednesday, September 25, 2019 12:31 PM
> > >>>
> > >>> On Tue, Sep 24, 2019 at 09:38:53AM +0800, Lu Baolu wrote:
> > >>>>>> intel_mmmap_range(domain, addr, end, phys_addr, prot)
> > >>>>>
> > >>>>> Maybe think of a different name..? mmmap seems a bit weird :-)
> > >>>>
> > >>>> Yes. I don't like it either. I've thought about it and haven't
> > >>>> figured out a satisfied one. Do you have any suggestions?
> > >>>
> > >>> How about at least split the word using "_"?  Like "mm_map", then
> > >>> apply it to all the "mmm*" prefixes.  Otherwise it'll be easily
> > >>> misread as mmap() which is totally irrelevant to this...
> > >>>
> > >>
> > >> what is the point of keeping 'mm' here? replace it with 'iommu'?
> > >
> > > I'm not sure of what Baolu thought, but to me "mm" makes sense itself
> > > to identify this from real IOMMU page tables (because IIUC these will
> > > be MMU page tables).  We can come up with better names, but IMHO
> > > "iommu" can be a bit misleading to let people refer to the 2nd level
> > > page table.
> > 
> > "mm" represents a CPU (first level) page table;
> > 
> > vs.
> > 
> > "io" represents an IOMMU (second level) page table.
> > 
> 
> IOMMU first level is not equivalent to CPU page table, though you can
> use the latter as the first level (e.g. in SVA). Especially here you are
> making IOVA->GPA as the first level, which is not CPU page table.
> 
> btw both levels are for "io" i.e. DMA purposes from VT-d p.o.v. They
> are just hierarchical structures implemented by VT-d, with slightly
> different format.

Regarding to the "slightly different format", do you mean the
extended-accessed bit?

Even if there are differences, they do look very similar.  If you see
the same chap 9.7 table, the elements are exactly called PTE, PDE,
PDPE, and so on - they're named exactly the same as MMU page tables.
With that, IMHO it still sounds reasonable if we want to relate this
"1st level iommu page table" with the existing MMU page table using
the "mm" prefix...

Regards,

-- 
Peter Xu


Re: [RFC PATCH 0/4] Use 1st-level for DMA remapping in guest

2019-09-25 Thread Peter Xu
On Wed, Sep 25, 2019 at 07:21:51AM +, Tian, Kevin wrote:
> > From: Peter Xu [mailto:pet...@redhat.com]
> > Sent: Wednesday, September 25, 2019 2:57 PM
> > 
> > On Wed, Sep 25, 2019 at 10:48:32AM +0800, Lu Baolu wrote:
> > > Hi Kevin,
> > >
> > > On 9/24/19 3:00 PM, Tian, Kevin wrote:
> > > > > > >   '---'
> > > > > > >   '---'
> > > > > > >
> > > > > > > This patch series only aims to achieve the first goal, a.k.a using
> > > > first goal? then what are other goals? I didn't spot such information.
> > > >
> > >
> > > The overall goal is to use IOMMU nested mode to avoid shadow page
> > table
> > > and VMEXIT when map an gIOVA. This includes below 4 steps (maybe not
> > > accurate, but you could get the point.)
> > >
> > > 1) GIOVA mappings over 1st-level page table;
> > > 2) binding vIOMMU 1st level page table to the pIOMMU;
> > > 3) using pIOMMU second level for GPA->HPA translation;
> > > 4) enable nested (a.k.a. dual stage) translation in host.
> > >
> > > This patch set aims to achieve 1).
> > 
> > Would it make sense to use 1st level even for bare-metal to replace
> > the 2nd level?
> > 
> > What I'm thinking is the DPDK apps - they have MMU page table already
> > there for the huge pages, then if they can use 1st level as the
> > default device page table then it even does not need to map, because
> > it can simply bind the process root page table pointer to the 1st
> > level page root pointer of the device contexts that it uses.
> > 
> 
> Then you need bear with possible page faults from using CPU page
> table, while most devices don't support it today. 

Right, I was just thinking aloud.  After all neither do we have IOMMU
hardware to support 1st level (or am I wrong?)...  It's just that when
the 1st level is ready it should sound doable because IIUC PRI should
be always with the 1st level support no matter on IOMMU side or the
device side?

I'm actually not sure about whether my understanding here is
correct... I thought the pasid binding previously was only for some
vendor kernel drivers but not a general thing to userspace.  I feel
like that should be doable in the future once we've got some new
syscall interface ready to deliver 1st level page table (e.g., via
vfio?) then applications like DPDK seems to be able to use that too
even directly via bare metal.

Regards,

-- 
Peter Xu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 0/4] Use 1st-level for DMA remapping in guest

2019-09-25 Thread Peter Xu
On Wed, Sep 25, 2019 at 10:48:32AM +0800, Lu Baolu wrote:
> Hi Kevin,
> 
> On 9/24/19 3:00 PM, Tian, Kevin wrote:
> > > > >   '---'
> > > > >   '---'
> > > > > 
> > > > > This patch series only aims to achieve the first goal, a.k.a using
> > first goal? then what are other goals? I didn't spot such information.
> > 
> 
> The overall goal is to use IOMMU nested mode to avoid shadow page table
> and VMEXIT when map an gIOVA. This includes below 4 steps (maybe not
> accurate, but you could get the point.)
> 
> 1) GIOVA mappings over 1st-level page table;
> 2) binding vIOMMU 1st level page table to the pIOMMU;
> 3) using pIOMMU second level for GPA->HPA translation;
> 4) enable nested (a.k.a. dual stage) translation in host.
> 
> This patch set aims to achieve 1).

Would it make sense to use 1st level even for bare-metal to replace
the 2nd level?

What I'm thinking is the DPDK apps - they have MMU page table already
there for the huge pages, then if they can use 1st level as the
default device page table then it even does not need to map, because
it can simply bind the process root page table pointer to the 1st
level page root pointer of the device contexts that it uses.

Regards,

-- 
Peter Xu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 4/4] iommu/vt-d: Identify domains using first level page table

2019-09-25 Thread Peter Xu
On Mon, Sep 23, 2019 at 08:24:54PM +0800, Lu Baolu wrote:
> +/*
> + * Check and return whether first level is used by default for
> + * DMA translation.
> + */
> +static bool first_level_by_default(void)
> +{
> + struct dmar_drhd_unit *drhd;
> + struct intel_iommu *iommu;
> +
> + rcu_read_lock();
> + for_each_active_iommu(iommu, drhd)
> + if (!sm_supported(iommu) ||
> + !ecap_flts(iommu->ecap) ||
> + !cap_caching_mode(iommu->cap))
> + return false;
> + rcu_read_unlock();
> +
> + return true;
> +}

"If no caching mode, then we will not use 1st level."

Hmm, does the vIOMMU needs to support caching-mode if with the
solution you proposed here?  Caching mode is only necessary for
shadowing AFAICT, and after all you're going to use full-nested,
then... then I would think it's not needed.  And if so, with this
patch 1st level will be disabled. Sounds like a paradox...

I'm thinking what would be the big picture for this to work now: For
the vIOMMU, instead of exposing the caching-mode, I'm thinking maybe
we should expose it with ecap.FLTS=1 while we can keep ecap.SLTS=0
then it's natural that we can only use 1st level translation in the
guest for all the domains (and I assume such an ecap value should
never happen on real hardware, am I right?).

Regards,

-- 
Peter Xu


Re: [RFC PATCH 2/4] iommu/vt-d: Add first level page table interfaces

2019-09-24 Thread Peter Xu
On Wed, Sep 25, 2019 at 04:38:31AM +, Tian, Kevin wrote:
> > From: Peter Xu [mailto:pet...@redhat.com]
> > Sent: Wednesday, September 25, 2019 12:31 PM
> > 
> > On Tue, Sep 24, 2019 at 09:38:53AM +0800, Lu Baolu wrote:
> > > > > intel_mmmap_range(domain, addr, end, phys_addr, prot)
> > > >
> > > > Maybe think of a different name..? mmmap seems a bit weird :-)
> > >
> > > Yes. I don't like it either. I've thought about it and haven't
> > > figured out a satisfied one. Do you have any suggestions?
> > 
> > How about at least split the word using "_"?  Like "mm_map", then
> > apply it to all the "mmm*" prefixes.  Otherwise it'll be easily
> > misread as mmap() which is totally irrelevant to this...
> > 
> 
> what is the point of keeping 'mm' here? replace it with 'iommu'?

I'm not sure of what Baolu thought, but to me "mm" makes sense itself
to identify this from real IOMMU page tables (because IIUC these will
be MMU page tables).  We can come up with better names, but IMHO
"iommu" can be a bit misleading to let people refer to the 2nd level
page table.

Regards,

-- 
Peter Xu


Re: [RFC PATCH 2/4] iommu/vt-d: Add first level page table interfaces

2019-09-24 Thread Peter Xu
set_pte(pte, pfn_pte(pfn, prot));
> + pfn++;
> + } while (pte++, addr += PAGE_SIZE, addr != end);
> +
> + domain_flush_cache(domain, first_pte, (void *)pte - (void *)first_pte);
> +
> + return 0;
> +}
> +
> +static int
> +mmmap_pmd_range(struct dmar_domain *domain, pud_t *pud, unsigned long addr,
> + unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
> +{
> + unsigned long next;
> + pmd_t *pmd;
> +
> + if (unlikely(pud_none(*pud)))
> + pgtable_populate(domain, pud);
> + pmd = pmd_offset(pud, addr);
> +
> + phys_addr -= addr;
> + do {
> + next = pmd_addr_end(addr, end);
> + if (mmmap_pte_range(domain, pmd, addr, next,
> + phys_addr + addr, prot))
> + return -ENOMEM;

How about return the errcode of mmmap_pte_range() directly?

> + } while (pmd++, addr = next, addr != end);
> +
> + return 0;
> +}
> +
> +static int
> +mmmap_pud_range(struct dmar_domain *domain, p4d_t *p4d, unsigned long addr,
> + unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
> +{
> + unsigned long next;
> + pud_t *pud;
> +
> + if (unlikely(p4d_none(*p4d)))
> + pgtable_populate(domain, p4d);
> +
> + pud = pud_offset(p4d, addr);
> +
> + phys_addr -= addr;
> + do {
> + next = pud_addr_end(addr, end);
> + if (mmmap_pmd_range(domain, pud, addr, next,
> + phys_addr + addr, prot))
> + return -ENOMEM;

Same.

> + } while (pud++, addr = next, addr != end);
> +
> + return 0;
> +}
> +
> +static int
> +mmmap_p4d_range(struct dmar_domain *domain, pgd_t *pgd, unsigned long addr,
> + unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
> +{
> + unsigned long next;
> + p4d_t *p4d;
> +
> + if (cpu_feature_enabled(X86_FEATURE_LA57) && unlikely(pgd_none(*pgd)))
> + pgtable_populate(domain, pgd);
> +
> + p4d = p4d_offset(pgd, addr);
> +
> + phys_addr -= addr;
> + do {
> + next = p4d_addr_end(addr, end);
> + if (mmmap_pud_range(domain, p4d, addr, next,
> + phys_addr + addr, prot))
> + return -ENOMEM;

Same.

> + } while (p4d++, addr = next, addr != end);
> +
> + return 0;
> +}
> +
> +int intel_mmmap_range(struct dmar_domain *domain, unsigned long addr,
> +   unsigned long end, phys_addr_t phys_addr, int dma_prot)
> +{
> + unsigned long next;
> + pgprot_t prot;
> + pgd_t *pgd;
> +
> + trace_domain_mm_map(domain, addr, end, phys_addr);
> +
> + /*
> +  * There is no PAGE_KERNEL_WO for a pte entry, so let's use RW
> +  * for a pte that requires write operation.
> +  */
> + prot = dma_prot & DMA_PTE_WRITE ? PAGE_KERNEL : PAGE_KERNEL_RO;
> + BUG_ON(addr >= end);
> +
> + phys_addr -= addr;
> + pgd = pgd_offset_pgd(domain->pgd, addr);
> + do {
> + next = pgd_addr_end(addr, end);
> + if (mmmap_p4d_range(domain, pgd, addr, next,
> + phys_addr + addr, prot))
> + return -ENOMEM;

Same.

> + } while (pgd++, addr = next, addr != end);
> +
> + return 0;
> +}
> +
> +/*
> + * mmunmap: Unmap an existing mapping between a range of IO vitual address
> + *  and physical addresses.
> + */
> +static struct page *
> +mmunmap_pte_range(struct dmar_domain *domain, pmd_t *pmd,
> +   unsigned long addr, unsigned long end,
> +   struct page *freelist, bool reclaim)
> +{
> + int i;
> + unsigned long start;
> + pte_t *pte, *first_pte;
> +
> + start = addr;
> + pte = pte_offset_kernel(pmd, addr);
> + first_pte = pte;
> + do {
> + set_pte(pte, __pte(0));
> + } while (pte++, addr += PAGE_SIZE, addr != end);
> +
> + domain_flush_cache(domain, first_pte, (void *)pte - (void *)first_pte);
> +
> + /* Add page to free list if all entries are empty. */
> + if (reclaim) {

Shouldn't we know whether to reclaim if with (addr, end) specified as
long as they cover the whole range of this PMD?

Also I noticed that this value right now is passed in from the very
top of the unmap() call.  I didn't really catch the point of that...

I'll have similar question to below a few places but I'll skip to
comment after I understand this one.

> + struct page *pte_page;
> +
> + pte = (pte_t *)pmd_page_vaddr(*pmd);
> + for (i = 0; i < PTRS_PER_PTE; i++)
> + if (!pte || !pte_none(pte[i]))
> + goto pte_out;
> +
> + pte_page = pmd_page(*pmd);
> + pte_page->freelist = freelist;
> + freelist = pte_page;
> + pmd_clear(pmd);
> + domain_flush_cache(domain, pmd, sizeof(pmd_t));
> + }
> +
> +pte_out:
> + return freelist;
> +}

Regards,

-- 
Peter Xu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 2/4] iommu/vt-d: Add first level page table interfaces

2019-09-24 Thread Peter Xu
On Tue, Sep 24, 2019 at 09:38:53AM +0800, Lu Baolu wrote:
> > > intel_mmmap_range(domain, addr, end, phys_addr, prot)
> > 
> > Maybe think of a different name..? mmmap seems a bit weird :-)
> 
> Yes. I don't like it either. I've thought about it and haven't
> figured out a satisfied one. Do you have any suggestions?

How about at least split the word using "_"?  Like "mm_map", then
apply it to all the "mmm*" prefixes.  Otherwise it'll be easily
misread as mmap() which is totally irrelevant to this...

Regards,

-- 
Peter Xu


[PATCH] Revert "iommu/vt-d: Fix lock inversion between iommu->lock and device_domain_lock"

2019-06-20 Thread Peter Xu
This reverts commit 7560cc3ca7d9d11555f80c830544e463fcdb28b8.

With 5.2.0-rc5 I can easily trigger this with lockdep and iommu=pt:

==
WARNING: possible circular locking dependency detected
5.2.0-rc5 #78 Not tainted
--
swapper/0/1 is trying to acquire lock:
ea2b3beb (&(>lock)->rlock){+.+.}, at: 
domain_context_mapping_one+0xa5/0x4e0
but task is already holding lock:
a681907b (device_domain_lock){}, at: 
domain_context_mapping_one+0x8d/0x4e0
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (device_domain_lock){}:
   _raw_spin_lock_irqsave+0x3c/0x50
   dmar_insert_one_dev_info+0xbb/0x510
   domain_add_dev_info+0x50/0x90
   dev_prepare_static_identity_mapping+0x30/0x68
   intel_iommu_init+0xddd/0x1422
   pci_iommu_init+0x16/0x3f
   do_one_initcall+0x5d/0x2b4
   kernel_init_freeable+0x218/0x2c1
   kernel_init+0xa/0x100
   ret_from_fork+0x3a/0x50
-> #0 (&(>lock)->rlock){+.+.}:
   lock_acquire+0x9e/0x170
   _raw_spin_lock+0x25/0x30
   domain_context_mapping_one+0xa5/0x4e0
   pci_for_each_dma_alias+0x30/0x140
   dmar_insert_one_dev_info+0x3b2/0x510
   domain_add_dev_info+0x50/0x90
   dev_prepare_static_identity_mapping+0x30/0x68
   intel_iommu_init+0xddd/0x1422
   pci_iommu_init+0x16/0x3f
   do_one_initcall+0x5d/0x2b4
   kernel_init_freeable+0x218/0x2c1
   kernel_init+0xa/0x100
   ret_from_fork+0x3a/0x50

other info that might help us debug this:
 Possible unsafe locking scenario:
   CPU0CPU1
   
  lock(device_domain_lock);
   lock(&(>lock)->rlock);
   lock(device_domain_lock);
  lock(&(>lock)->rlock);

 *** DEADLOCK ***
2 locks held by swapper/0/1:
 #0: 033eb13d (dmar_global_lock){}, at: 
intel_iommu_init+0x1e0/0x1422
 #1: a681907b (device_domain_lock){}, at: 
domain_context_mapping_one+0x8d/0x4e0

stack backtrace:
CPU: 2 PID: 1 Comm: swapper/0 Not tainted 5.2.0-rc5 #78
Hardware name: LENOVO 20KGS35G01/20KGS35G01, BIOS N23ET50W (1.25 ) 
06/25/2018
Call Trace:
 dump_stack+0x85/0xc0
 print_circular_bug.cold.57+0x15c/0x195
 __lock_acquire+0x152a/0x1710
 lock_acquire+0x9e/0x170
 ? domain_context_mapping_one+0xa5/0x4e0
 _raw_spin_lock+0x25/0x30
 ? domain_context_mapping_one+0xa5/0x4e0
 domain_context_mapping_one+0xa5/0x4e0
 ? domain_context_mapping_one+0x4e0/0x4e0
 pci_for_each_dma_alias+0x30/0x140
 dmar_insert_one_dev_info+0x3b2/0x510
 domain_add_dev_info+0x50/0x90
 dev_prepare_static_identity_mapping+0x30/0x68
 intel_iommu_init+0xddd/0x1422
 ? printk+0x58/0x6f
 ? lockdep_hardirqs_on+0xf0/0x180
 ? do_early_param+0x8e/0x8e
 ? e820__memblock_setup+0x63/0x63
 pci_iommu_init+0x16/0x3f
 do_one_initcall+0x5d/0x2b4
 ? do_early_param+0x8e/0x8e
 ? rcu_read_lock_sched_held+0x55/0x60
 ? do_early_param+0x8e/0x8e
 kernel_init_freeable+0x218/0x2c1
 ? rest_init+0x230/0x230
 kernel_init+0xa/0x100
 ret_from_fork+0x3a/0x50

domain_context_mapping_one() is taking device_domain_lock first then
iommu lock, while dmar_insert_one_dev_info() is doing the reverse.

That should be introduced by commit:

7560cc3ca7d9 ("iommu/vt-d: Fix lock inversion between iommu->lock and
  device_domain_lock", 2019-05-27)

So far I still cannot figure out how the previous deadlock was
triggered (I cannot find iommu lock taken before calling of
iommu_flush_dev_iotlb()), however I'm pretty sure that that change
should be incomplete at least because it does not fix all the places
so we're still taking the locks in different orders, while reverting
that commit is very clean to me so far that we should always take
device_domain_lock first then the iommu lock.

We can continue to try to find the real culprit mentioned in
7560cc3ca7d9, but for now I think we should revert it to fix current
breakage.

CC: Joerg Roedel 
CC: Lu Baolu 
CC: dave.ji...@intel.com
Signed-off-by: Peter Xu 
---
 drivers/iommu/intel-iommu.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 56297298d6ee..162b3236e72c 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -2504,7 +2504,6 @@ static struct dmar_domain 
*dmar_insert_one_dev_info(struct intel_iommu *iommu,
}
}
 
-   spin_lock(>lock);
spin_lock_irqsave(_domain_lock, flags);
if (dev)
  

Re: VT-d deadlock issue on device_domain_lock and iommu lock (5.2-rc5)

2019-06-20 Thread Peter Xu
On Fri, Jun 21, 2019 at 09:58:28AM +0800, Lu Baolu wrote:
> Hi Peter,
> 
> I agree with you that 7560cc3ca7d9 ("iommu/vt-d: Fix lock inversion between
> iommu->lock and device_domain_lock") isn't a good fix. There
> is also another thread, https://lkml.org/lkml/2019/6/18/996, which
> reported this.
> 
> I think we can revert this patch now. I will try to reproduce the
> original issue and try to find a new fix.
> 
> Can you please submit the revert patch?

Sure, I'll post a formal patch soon.  Thanks,

-- 
Peter Xu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


VT-d deadlock issue on device_domain_lock and iommu lock (5.2-rc5)

2019-06-20 Thread Peter Xu
2
Jun 20 14:37:37 xz-x1 kernel:  ? printk+0x58/0x6f
Jun 20 14:37:37 xz-x1 kernel:  ? lockdep_hardirqs_on+0xf0/0x180
Jun 20 14:37:37 xz-x1 kernel:  ? do_early_param+0x8e/0x8e
Jun 20 14:37:37 xz-x1 kernel:  ? e820__memblock_setup+0x63/0x63
Jun 20 14:37:37 xz-x1 kernel:  pci_iommu_init+0x16/0x3f
Jun 20 14:37:37 xz-x1 kernel:  do_one_initcall+0x5d/0x2b4
Jun 20 14:37:37 xz-x1 kernel:  ? do_early_param+0x8e/0x8e
Jun 20 14:37:37 xz-x1 kernel:  ? rcu_read_lock_sched_held+0x55/0x60
Jun 20 14:37:37 xz-x1 kernel:  ? do_early_param+0x8e/0x8e
Jun 20 14:37:37 xz-x1 kernel:  kernel_init_freeable+0x218/0x2c1
Jun 20 14:37:37 xz-x1 kernel:  ? rest_init+0x230/0x230
Jun 20 14:37:37 xz-x1 kernel:  kernel_init+0xa/0x100
Jun 20 14:37:37 xz-x1 kernel:  ret_from_fork+0x3a/0x50


domain_context_mapping_one() is taking device_domain_lock first then
iommu lock, while dmar_insert_one_dev_info() is doing the reverse.

I digged a bit and I saw this commit which seems suspicous:

7560cc3ca7d9 ("iommu/vt-d: Fix lock inversion between iommu->lock and
  device_domain_lock", 2019-05-27)

More interestingly, it was trying to fix the inverted deadlock...

The thing is that even if above commit is correct on the ordering (I
still feel strange that we need to take a per-iommu lock before
another global lock), that commit seems to be an incomplete fix
because there's still other places that are using the other way round.

When I read deeper into that commit message, it seems to be telling me
that before reaching iommu_flush_dev_iotlb() we've got iommu lock
somewhere but I cannot really understand how it happened because I
cannot find a path that iommu lock is taken when reaching
iommu_flush_dev_iotlb().  So I cannot understand how that lockdep
warning message could trigger.

I reverted that commit and now everything is good here (no long runs
but at least previous deadlock issue is fixed).  And with that, IMHO
we'll actually have the correct ordering in the whole repository that
we'll take device_domain_lock before per iommu lock always.

Is there anything I've missed on why we have had 7560cc3ca7d9?

Thanks,

-- 
Peter Xu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: What is the meaning of PASID_MIN?

2019-02-11 Thread Peter Xu
On Tue, Feb 12, 2019 at 10:44:23AM +0800, Lu Baolu wrote:
> Hi,
> 
> On 2/12/19 3:28 AM, Matthew Wilcox wrote:
> > 
> > I'm looking at commit 562831747f6299abd481b5b00bd4fa19d5c8a259
> > which fails to adequately explain why we can't use PASID 0.  Commit
> > af39507305fb83a5d3c475c2851f4d59545d8a18 also doesn't explain why PASID
> > 0 is no longer usable for the intel-svm driver.
> 
> Sorry that we didn't make it clear.
> 
> > 
> > There are a load of simplifications that could be made to this, but I
> > don't know which ones to suggest without a clear understanding of the
> > problem you're actually trying to solve.
> > 
> 
> PASID 0 has been reserved by Intel IOMMU driver for RID_PASID purpose.
> 
> VT-d scalable mode treats all address translation as PASID granularity.
> For DMA requests-with-PASID, the PASID value in the DMA request will be
> used. For DMA requests-without-PASID, VT-d will use a static PASID value
> specified in the RID_PASID field of the context entry. PASID 0 has been
> reserved for this usage for all devices.
> 
> (Please refer to 9.4 of the spec 3.0 for more details.)

Hi, Baolu,

I have a similar confusion.

If PASID==0 is reserved for requests-without-PASID, then does it mean
that for each scalable mode context entry the RID_PASID field will
always be zero?  Or say, since we already have the per-context-entry
RID_PASID field which seems to be configurable, why we still need to
reserve the PASID==0?

Thanks,

-- 
Peter Xu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 2/2] iommu/amd: Remove clear_flush_young notifier

2019-02-01 Thread Peter Xu
On Fri, Feb 01, 2019 at 11:46:00AM +, Jean-Philippe Brucker wrote:
> On 01/02/2019 03:51, Peter Xu wrote:
> > On Thu, Jan 31, 2019 at 12:25:40PM +, Jean-Philippe Brucker wrote:
> >> On 31/01/2019 07:59, Peter Xu wrote:
> >>> On Wed, Jan 30, 2019 at 12:27:40PM +, Jean-Philippe Brucker wrote:
> >>>> Hi Peter,
> >>>
> >>> Hi, Jean,
> >>>
> >>>>
> >>>> On 30/01/2019 05:57, Peter Xu wrote:
> >>>>> AMD IOMMU driver is using the clear_flush_young() to do cache flushing
> >>>>> but that's actually already covered by invalidate_range().  Remove the
> >>>>> extra notifier and the chunks.
> >>>>
> >>>> Aren't uses of clear_flush_young() and invalidate_range() orthogonal? If
> >>>> I understood correctly, when doing reclaim the kernel clears the young
> >>>> bit from the PTE. This requires flushing secondary TLBs (ATCs), so that
> >>>> new accesses from devices will go through the IOMMU, set the young bit
> >>>> again and the kernel can accurately track page use. I didn't see
> >>>> invalidate_range() being called by rmap or vmscan in this case, but
> >>>> might just be missing it.
> >>>>
> >>>> Two MMU notifiers are used for the young bit, clear_flush_young() and
> >>>> clear_young(). I believe the former should invalidate ATCs so that DMA
> >>>> accesses participate in PTE aging. Otherwise the kernel can't know that
> >>>> the device is still using entries marked 'old'. The latter,
> >>>> clear_young() is exempted from invalidating the secondary TLBs by
> >>>> mmu_notifier.h and the IOMMU driver doesn't need to implement it.
> >>>
> >>> Yes I agree most of your analysis, but IMHO the problem is that the
> >>> AMD IOMMU is not really participating in the PTE aging after all.
> >>> Please have a look at mn_clear_flush_young() below at [1] - it's
> >>> always returning zero, no matter whether the page has been accessed by
> >>> device or not.  A real user of it could be someone like KVM (please
> >>> see kvm_mmu_notifier_clear_flush_young) where we'll try to lookup the
> >>> shadow PTEs and do test-and-clear on that, then return the result to
> >>> the core mm.  That's why I think currently the AMD driver was only
> >>> trying to use that as a way to do an extra flush however IMHO it's
> >>> redundant.
> >>
> >> Yes, in IOMMU drivers clear_flush_young() doesn't do the clear, only the
> >> flush, since the level-1 page table is shared with the CPU. But removing
> >> the flush gets you in the following situation:
> >>
> >>   (1) Devices wants to access $addr, sends ATS query, the IOMMU sets PTE
> >> young and the entry is cached in the ATC.
> >>
> >>   (2) The CPU does ptep_clear_flush_young_notify(), clears young,
> >> notices that the page is being used.
> >>
> >>   (3) Device accesses $addr again. Given that we didn't invalidate the
> >> ATC in (2) it accesses the page directly, without going through the IOMMU.
> >>
> >>   (4) CPU does ptep_clear_flush_young_notify() again, the PTE doesn't
> >> have the young bit, which means the page isn't being used and can be
> >> reclaimed.
> >>
> >> That's not accurate since the page is being used by the device. At step
> >> (2) we should invalidate the ATC, so that (3) fetches the PTE again and
> >> marks it young.
> >>
> >> I can agree that the clear_flush_young() notifier is too brutal for this
> >> purpose, since we send spurious ATC invalidation even when the PTE
> >> wasn't young (and ATC inv is expensive). There should be another MMU
> >> notifier "flush_young()" that is called by
> >> ptep_clear_flush_young_notify() only when the page was actually young.
> >> But for the moment it's the best we have to avoid the situation above.
> >>
> >> I don't know enough about mm to understand exactly how the
> >> page_referenced() information is used, but I believe the problem is only
> >> about accuracy and not correctness. So applying this patch might not
> >> break anything (after all, intel-svm.c never implemented the notifier)
> >> but I think I'll keep the notifier in my SVA rework [1].
> > 
> > I see your point.  I'm not an expert of mm either, but I'd say AFAIU
> > this happens in CPU TLB too.

Re: [PATCH 2/2] iommu/amd: Remove clear_flush_young notifier

2019-01-31 Thread Peter Xu
On Thu, Jan 31, 2019 at 12:25:40PM +, Jean-Philippe Brucker wrote:
> On 31/01/2019 07:59, Peter Xu wrote:
> > On Wed, Jan 30, 2019 at 12:27:40PM +, Jean-Philippe Brucker wrote:
> >> Hi Peter,
> > 
> > Hi, Jean,
> > 
> >>
> >> On 30/01/2019 05:57, Peter Xu wrote:
> >>> AMD IOMMU driver is using the clear_flush_young() to do cache flushing
> >>> but that's actually already covered by invalidate_range().  Remove the
> >>> extra notifier and the chunks.
> >>
> >> Aren't uses of clear_flush_young() and invalidate_range() orthogonal? If
> >> I understood correctly, when doing reclaim the kernel clears the young
> >> bit from the PTE. This requires flushing secondary TLBs (ATCs), so that
> >> new accesses from devices will go through the IOMMU, set the young bit
> >> again and the kernel can accurately track page use. I didn't see
> >> invalidate_range() being called by rmap or vmscan in this case, but
> >> might just be missing it.
> >>
> >> Two MMU notifiers are used for the young bit, clear_flush_young() and
> >> clear_young(). I believe the former should invalidate ATCs so that DMA
> >> accesses participate in PTE aging. Otherwise the kernel can't know that
> >> the device is still using entries marked 'old'. The latter,
> >> clear_young() is exempted from invalidating the secondary TLBs by
> >> mmu_notifier.h and the IOMMU driver doesn't need to implement it.
> > 
> > Yes I agree most of your analysis, but IMHO the problem is that the
> > AMD IOMMU is not really participating in the PTE aging after all.
> > Please have a look at mn_clear_flush_young() below at [1] - it's
> > always returning zero, no matter whether the page has been accessed by
> > device or not.  A real user of it could be someone like KVM (please
> > see kvm_mmu_notifier_clear_flush_young) where we'll try to lookup the
> > shadow PTEs and do test-and-clear on that, then return the result to
> > the core mm.  That's why I think currently the AMD driver was only
> > trying to use that as a way to do an extra flush however IMHO it's
> > redundant.
> 
> Yes, in IOMMU drivers clear_flush_young() doesn't do the clear, only the
> flush, since the level-1 page table is shared with the CPU. But removing
> the flush gets you in the following situation:
> 
>   (1) Devices wants to access $addr, sends ATS query, the IOMMU sets PTE
> young and the entry is cached in the ATC.
> 
>   (2) The CPU does ptep_clear_flush_young_notify(), clears young,
> notices that the page is being used.
> 
>   (3) Device accesses $addr again. Given that we didn't invalidate the
> ATC in (2) it accesses the page directly, without going through the IOMMU.
> 
>   (4) CPU does ptep_clear_flush_young_notify() again, the PTE doesn't
> have the young bit, which means the page isn't being used and can be
> reclaimed.
> 
> That's not accurate since the page is being used by the device. At step
> (2) we should invalidate the ATC, so that (3) fetches the PTE again and
> marks it young.
> 
> I can agree that the clear_flush_young() notifier is too brutal for this
> purpose, since we send spurious ATC invalidation even when the PTE
> wasn't young (and ATC inv is expensive). There should be another MMU
> notifier "flush_young()" that is called by
> ptep_clear_flush_young_notify() only when the page was actually young.
> But for the moment it's the best we have to avoid the situation above.
> 
> I don't know enough about mm to understand exactly how the
> page_referenced() information is used, but I believe the problem is only
> about accuracy and not correctness. So applying this patch might not
> break anything (after all, intel-svm.c never implemented the notifier)
> but I think I'll keep the notifier in my SVA rework [1].

I see your point.  I'm not an expert of mm either, but I'd say AFAIU
this happens in CPU TLB too.  Please have a look at
ptep_clear_flush_young():

int ptep_clear_flush_young(struct vm_area_struct *vma,
   unsigned long address, pte_t *ptep)
{
/*
 * On x86 CPUs, clearing the accessed bit without a TLB flush
 * doesn't cause data corruption. [ It could cause incorrect
 * page aging and the (mistaken) reclaim of hot pages, but the
 * chance of that should be relatively low. ]
 *
 * So as a performance optimization don't flush the TLB when
 * clearing the accessed bit, it will eventually be flushed by
 * a context switch or a VM operation anyway. [ In the rare
 * event of it not getting flushed for a long time the delay
 * shouldn't really matter bec

Re: [PATCH 2/2] iommu/amd: Remove clear_flush_young notifier

2019-01-31 Thread Peter Xu
On Wed, Jan 30, 2019 at 12:27:40PM +, Jean-Philippe Brucker wrote:
> Hi Peter,

Hi, Jean,

> 
> On 30/01/2019 05:57, Peter Xu wrote:
> > AMD IOMMU driver is using the clear_flush_young() to do cache flushing
> > but that's actually already covered by invalidate_range().  Remove the
> > extra notifier and the chunks.
> 
> Aren't uses of clear_flush_young() and invalidate_range() orthogonal? If
> I understood correctly, when doing reclaim the kernel clears the young
> bit from the PTE. This requires flushing secondary TLBs (ATCs), so that
> new accesses from devices will go through the IOMMU, set the young bit
> again and the kernel can accurately track page use. I didn't see
> invalidate_range() being called by rmap or vmscan in this case, but
> might just be missing it.
> 
> Two MMU notifiers are used for the young bit, clear_flush_young() and
> clear_young(). I believe the former should invalidate ATCs so that DMA
> accesses participate in PTE aging. Otherwise the kernel can't know that
> the device is still using entries marked 'old'. The latter,
> clear_young() is exempted from invalidating the secondary TLBs by
> mmu_notifier.h and the IOMMU driver doesn't need to implement it.

Yes I agree most of your analysis, but IMHO the problem is that the
AMD IOMMU is not really participating in the PTE aging after all.
Please have a look at mn_clear_flush_young() below at [1] - it's
always returning zero, no matter whether the page has been accessed by
device or not.  A real user of it could be someone like KVM (please
see kvm_mmu_notifier_clear_flush_young) where we'll try to lookup the
shadow PTEs and do test-and-clear on that, then return the result to
the core mm.  That's why I think currently the AMD driver was only
trying to use that as a way to do an extra flush however IMHO it's
redundant.

Thanks,

> > -static int mn_clear_flush_young(struct mmu_notifier *mn,
> > -   struct mm_struct *mm,
> > -   unsigned long start,
> > -   unsigned long end)
> > -{
> > -   for (; start < end; start += PAGE_SIZE)
> > -   __mn_flush_page(mn, start);
> > -
> > -   return 0;

[1]

-- 
Peter Xu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 1/2] iommu/vt-d: Remove change_pte notifier

2019-01-29 Thread Peter Xu
The change_pte() interface is tailored for PFN updates, while the
other notifier invalidate_range() should be enough for Intel IOMMU
cache flushing.  Actually we've done similar thing for AMD IOMMU
already in 8301da53fbc1 ("iommu/amd: Remove change_pte mmu_notifier
call-back", 2014-07-30) but the Intel IOMMU driver still have it.

Signed-off-by: Peter Xu 
---
 drivers/iommu/intel-svm.c | 9 -
 1 file changed, 9 deletions(-)

diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
index a2a2aa4439aa..e9fd3ca057ac 100644
--- a/drivers/iommu/intel-svm.c
+++ b/drivers/iommu/intel-svm.c
@@ -180,14 +180,6 @@ static void intel_flush_svm_range(struct intel_svm *svm, 
unsigned long address,
rcu_read_unlock();
 }
 
-static void intel_change_pte(struct mmu_notifier *mn, struct mm_struct *mm,
-unsigned long address, pte_t pte)
-{
-   struct intel_svm *svm = container_of(mn, struct intel_svm, notifier);
-
-   intel_flush_svm_range(svm, address, 1, 1, 0);
-}
-
 /* Pages have been freed at this point */
 static void intel_invalidate_range(struct mmu_notifier *mn,
   struct mm_struct *mm,
@@ -227,7 +219,6 @@ static void intel_mm_release(struct mmu_notifier *mn, 
struct mm_struct *mm)
 
 static const struct mmu_notifier_ops intel_mmuops = {
.release = intel_mm_release,
-   .change_pte = intel_change_pte,
.invalidate_range = intel_invalidate_range,
 };
 
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 2/2] iommu/amd: Remove clear_flush_young notifier

2019-01-29 Thread Peter Xu
AMD IOMMU driver is using the clear_flush_young() to do cache flushing
but that's actually already covered by invalidate_range().  Remove the
extra notifier and the chunks.

Signed-off-by: Peter Xu 
---
 drivers/iommu/amd_iommu_v2.c | 24 
 1 file changed, 24 deletions(-)

diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
index 23dae9348ace..5d7ef750e4a0 100644
--- a/drivers/iommu/amd_iommu_v2.c
+++ b/drivers/iommu/amd_iommu_v2.c
@@ -370,29 +370,6 @@ static struct pasid_state *mn_to_state(struct mmu_notifier 
*mn)
return container_of(mn, struct pasid_state, mn);
 }
 
-static void __mn_flush_page(struct mmu_notifier *mn,
-   unsigned long address)
-{
-   struct pasid_state *pasid_state;
-   struct device_state *dev_state;
-
-   pasid_state = mn_to_state(mn);
-   dev_state   = pasid_state->device_state;
-
-   amd_iommu_flush_page(dev_state->domain, pasid_state->pasid, address);
-}
-
-static int mn_clear_flush_young(struct mmu_notifier *mn,
-   struct mm_struct *mm,
-   unsigned long start,
-   unsigned long end)
-{
-   for (; start < end; start += PAGE_SIZE)
-   __mn_flush_page(mn, start);
-
-   return 0;
-}
-
 static void mn_invalidate_range(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long start, unsigned long end)
@@ -430,7 +407,6 @@ static void mn_release(struct mmu_notifier *mn, struct 
mm_struct *mm)
 
 static const struct mmu_notifier_ops iommu_mn = {
.release= mn_release,
-   .clear_flush_young  = mn_clear_flush_young,
.invalidate_range   = mn_invalidate_range,
 };
 
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 0/2] Some MMU notifier cleanups for Intel/AMD IOMMU

2019-01-29 Thread Peter Xu
Recently when I'm reading the mmu notifiers I noticed that both
Intel/AMD IOMMU drivers seem to have redundancies in using the MMU
notifiers.  It can also be seen as a follow up of commit 8301da53fbc1
("iommu/amd: Remove change_pte mmu_notifier call-back", 2014-07-30).

I don't have hardwares to test them, but they compile well.

Please have a look, thanks.

Peter Xu (2):
  iommu/vt-d: Remove change_pte notifier
  iommu/amd: Remove clear_flush_young notifier

 drivers/iommu/amd_iommu_v2.c | 24 
 drivers/iommu/intel-svm.c|  9 -
 2 files changed, 33 deletions(-)

-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v5 3/9] iommu/vt-d: Apply global PASID in SVA

2018-07-15 Thread Peter Xu
On Sat, Jul 14, 2018 at 03:46:56PM +0800, Lu Baolu wrote:
> This patch applies the global pasid name space in the shared
> virtual address (SVA) implementation.
> 
> Cc: Ashok Raj 
> Cc: Jacob Pan 
> Cc: Kevin Tian 
> Cc: Liu Yi L 
> Signed-off-by: Lu Baolu 
> Reviewed-by: Kevin Tian 
> Reviewed-by: Liu Yi L 

Reviewed-by: Peter Xu 

-- 
Peter Xu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v5 2/9] iommu/vt-d: Avoid using idr_for_each_entry()

2018-07-15 Thread Peter Xu
On Sat, Jul 14, 2018 at 03:46:55PM +0800, Lu Baolu wrote:
> idr_for_each_entry() is used to iteratte over idr elements
> of a given type. It isn't suitable for the globle pasid idr
> since the pasid idr consumer could specify different types
> of pointers to bind with a pasid.
> 
> Cc: Ashok Raj 
> Cc: Jacob Pan 
> Cc: Kevin Tian 
> Cc: Liu Yi L 
> Signed-off-by: Lu Baolu 
> Reviewed-by: Kevin Tian 
> Reviewed-by: Liu Yi L 

Reviewed-by: Peter Xu 

-- 
Peter Xu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v5 1/9] iommu/vt-d: Global PASID name space

2018-07-15 Thread Peter Xu
On Sat, Jul 14, 2018 at 03:46:54PM +0800, Lu Baolu wrote:
> This adds the system wide PASID name space for the PASID
> allocation. Currently we are using per IOMMU PASID name
> spaces which are not suitable for some use cases. For an
> example, one application (associated with a PASID) might
> talk to two physical devices simultaneously while the two
> devices could reside behind two different IOMMU units.
> 
> Cc: Ashok Raj 
> Cc: Jacob Pan 
> Cc: Kevin Tian 
> Cc: Liu Yi L 
> Suggested-by: Ashok Raj 
> Signed-off-by: Lu Baolu 
> Reviewed-by: Kevin Tian 
> Reviewed-by: Liu Yi L 

Reviewed-by: Peter Xu 

-- 
Peter Xu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 9/9] iommu/vt-d: Remove the obsolete per iommu pasid tables

2018-07-12 Thread Peter Xu
On Fri, Jul 13, 2018 at 09:34:30AM +0800, Lu Baolu wrote:
> Hi Peter,
> 
> Do you have further comments for this series? I will work out a
> new version if there are no ones.

Not for now.  Though since you mentioned about the work to remove the
pasid state table, not sure whether it'll be nice to put that into
this series, or even put more patches that are directly related
(e.g. including the two-level pasid table work and anything related)
so the series could be more complete.

> 
> Thank you for reviewing my patches. Do you allow me to add your
> reviewed-by to the patches which you've reviewed?

I'm glad to read your work.  I didn't leave r-bs because I'm not that
familiar with the codes so I'm not confident to leave any, just to
raise the questions.  Please feel free to post a new version as you
wish, I'll try to leave r-b there if possible.

Thanks,

-- 
Peter Xu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 6/9] iommu/vt-d: Per PCI device pasid table interfaces

2018-07-11 Thread Peter Xu
On Wed, Jul 11, 2018 at 03:26:21PM +0800, Lu Baolu wrote:

[...]

> >> +int intel_pasid_alloc_table(struct device *dev)
> >> +{
> >> +  struct device_domain_info *info;
> >> +  struct pasid_table *pasid_table;
> >> +  struct pasid_table_opaque data;
> >> +  struct page *pages;
> >> +  size_t size, count;
> >> +  int ret, order;
> >> +
> >> +  info = dev->archdata.iommu;
> >> +  if (WARN_ON(!info || !dev_is_pci(dev) ||
> >> +  !info->pasid_supported || info->pasid_table))
> >> +  return -EINVAL;
> >> +
> >> +  /* DMA alias device already has a pasid table, use it: */
> >> +  data.pasid_table = _table;
> >> +  ret = pci_for_each_dma_alias(to_pci_dev(dev),
> >> +   _alias_pasid_table, );
> >> +  if (ret)
> >> +  goto attach_out;
> >> +
> >> +  pasid_table = kzalloc(sizeof(*pasid_table), GFP_ATOMIC);
> > Do we need to take some lock here (e.g., the pasid lock)?  Otherwise
> > what if two devices (that are sharing the same DMA alias) call the
> > function intel_pasid_alloc_table() concurrently, then could it
> > possible that we create one table for each of the device while AFAIU
> > we should let them share a single pasid table?
> 
> The only place where this function is called is in a single-thread context
> (protected by a spinlock of device_domain_lock with local interrupt disabled).
> 
> So we don't need an extra lock here. But anyway, I should put a comment
> here.

Yeah, that would be nice too!  Or add a comment for both of the
functions:

  /* Must be with device_domain_lock held */

Regards,

-- 
Peter Xu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 1/9] iommu/vt-d: Global PASID name space

2018-07-10 Thread Peter Xu
On Mon, Jul 09, 2018 at 01:22:50PM +0800, Lu Baolu wrote:

[...]

> +#ifndef __INTEL_PASID_H
> +#define __INTEL_PASID_H
> +
> +#define PASID_MIN0x1
> +#define PASID_MAX0x2

Could I ask whether there's a reason to explicitly use 0x2 for the
max value?  Asked since I saw that the example in the spec gave 20
bits for PASID (please refer to spec ver 3.0 section 3.4.3 figure
3-8).  Also I believe that's what I was told by Kevin.

I saw that the old per-iommu max value is set to 0x2, though I'm
not sure whether that's still needed since if we're going to have
two-level pasid table then AFAIU we don't need physically continuous
memory any more (though I saw that we don't yet have two-level pasid
table implemented):

/* Eventually I'm promised we will get a multi-level PASID table
 * and it won't have to be physically contiguous. Until then,
 * limit the size because 8MiB contiguous allocations can be hard
 * to come by. The limit of 0x2, which is 1MiB for each of
 * the PASID and PASID-state tables, is somewhat arbitrary. */
if (iommu->pasid_max > 0x2)
iommu->pasid_max = 0x20000;

Thanks,

-- 
Peter Xu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 9/9] iommu/vt-d: Remove the obsolete per iommu pasid tables

2018-07-10 Thread Peter Xu
ach_device_domain(int (*fn)(struct 
> device_domain_info *info,
>void *data), void *data);
>  
>  #ifdef CONFIG_INTEL_IOMMU_SVM
> -extern int intel_svm_alloc_pasid_tables(struct intel_iommu *iommu);
> -extern int intel_svm_free_pasid_tables(struct intel_iommu *iommu);
> +int intel_svm_init(struct intel_iommu *iommu);
> +int intel_svm_exit(struct intel_iommu *iommu);
>  extern int intel_svm_enable_prq(struct intel_iommu *iommu);
>  extern int intel_svm_finish_prq(struct intel_iommu *iommu);
>  
> -- 
> 2.7.4
> 
> ___
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu

-- 
Peter Xu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 6/9] iommu/vt-d: Per PCI device pasid table interfaces

2018-07-10 Thread Peter Xu
val;
> -};
> -
>  struct pasid_state_entry {
>   u64 val;
>  };
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index 4fd4c6f..e7901d4 100644
> --- a/include/linux/intel-iommu.h
> +++ b/include/linux/intel-iommu.h
> @@ -476,6 +476,7 @@ struct intel_iommu {
>  struct device_domain_info {
>   struct list_head link;  /* link to domain siblings */
>   struct list_head global; /* link to global list */
> + struct list_head table; /* link to pasid table */
>   u8 bus; /* PCI bus number */
>   u8 devfn;   /* PCI devfn number */
>   u16 pfsid;  /* SRIOV physical function source ID */
> @@ -489,6 +490,7 @@ struct device_domain_info {
>   struct device *dev; /* it's NULL for PCIe-to-PCI bridge */
>   struct intel_iommu *iommu; /* IOMMU used by this device */
>   struct dmar_domain *domain; /* pointer to domain */
> + struct pasid_table *pasid_table; /* pasid table */
>  };
>  
>  static inline void __iommu_flush_cache(
> -- 
> 2.7.4
> 
> ___
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu

-- 
Peter Xu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 2/2] iommu/vt-d: Fix iotlb psi missing for mappings

2018-05-03 Thread Peter Xu
When caching mode is enabled for IOMMU, we should send explicit IOTLB
PSIs even for newly created mappings.  However these events are missing
for all intel_iommu_map() callers, e.g., iommu_map().  One direct user
is the vfio-pci driver.

To make sure we'll send the PSIs always when necessary, this patch
firstly introduced domain_mapping() helper for page mappings, then fixed
the problem by generalizing the explicit map IOTLB PSI logic into that
new helper. With that, we let iommu_domain_identity_map() to use the
simplified version to avoid sending the notifications, while for all the
rest of cases we send the notifications always.

For VM case, we send the PSIs to all the backend IOMMUs for the domain.

This patch allows the nested device assignment to work with QEMU (assign
device firstly to L1 guest, then assign it again to L2 guest).

Signed-off-by: Peter Xu <pet...@redhat.com>
---
 drivers/iommu/intel-iommu.c | 43 +
 1 file changed, 34 insertions(+), 9 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 13190a54aba2..601d3789211f 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -2352,18 +2352,47 @@ static int __domain_mapping(struct dmar_domain *domain, 
unsigned long iov_pfn,
return 0;
 }
 
+static int domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
+ struct scatterlist *sg, unsigned long phys_pfn,
+ unsigned long nr_pages, int prot)
+{
+   int ret;
+   struct intel_iommu *iommu;
+
+   /* Do the real mapping first */
+   ret = __domain_mapping(domain, iov_pfn, sg, phys_pfn, nr_pages, prot);
+   if (ret)
+   return ret;
+
+   /* Notify about the new mapping */
+   if (domain_type_is_vm(domain)) {
+  /* VM typed domains can have more than one IOMMUs */
+  int iommu_id;
+  for_each_domain_iommu(iommu_id, domain) {
+  iommu = g_iommus[iommu_id];
+  __mapping_notify_one(iommu, domain, iov_pfn, nr_pages);
+  }
+   } else {
+  /* General domains only have one IOMMU */
+  iommu = domain_get_iommu(domain);
+  __mapping_notify_one(iommu, domain, iov_pfn, nr_pages);
+   }
+
+   return 0;
+}
+
 static inline int domain_sg_mapping(struct dmar_domain *domain, unsigned long 
iov_pfn,
struct scatterlist *sg, unsigned long 
nr_pages,
int prot)
 {
-   return __domain_mapping(domain, iov_pfn, sg, 0, nr_pages, prot);
+   return domain_mapping(domain, iov_pfn, sg, 0, nr_pages, prot);
 }
 
 static inline int domain_pfn_mapping(struct dmar_domain *domain, unsigned long 
iov_pfn,
 unsigned long phys_pfn, unsigned long 
nr_pages,
 int prot)
 {
-   return __domain_mapping(domain, iov_pfn, NULL, phys_pfn, nr_pages, 
prot);
+   return domain_mapping(domain, iov_pfn, NULL, phys_pfn, nr_pages, prot);
 }
 
 static void domain_context_clear_one(struct intel_iommu *iommu, u8 bus, u8 
devfn)
@@ -2668,9 +2697,9 @@ static int iommu_domain_identity_map(struct dmar_domain 
*domain,
 */
dma_pte_clear_range(domain, first_vpfn, last_vpfn);
 
-   return domain_pfn_mapping(domain, first_vpfn, first_vpfn,
- last_vpfn - first_vpfn + 1,
- DMA_PTE_READ|DMA_PTE_WRITE);
+   return __domain_mapping(domain, first_vpfn, NULL,
+   first_vpfn, last_vpfn - first_vpfn + 1,
+   DMA_PTE_READ|DMA_PTE_WRITE);
 }
 
 static int domain_prepare_identity_map(struct device *dev,
@@ -3637,8 +3666,6 @@ static dma_addr_t __intel_map_single(struct device *dev, 
phys_addr_t paddr,
if (ret)
goto error;
 
-   __mapping_notify_one(iommu, domain, mm_to_dma_pfn(iova_pfn), size);
-
start_paddr = (phys_addr_t)iova_pfn << PAGE_SHIFT;
start_paddr += paddr & ~PAGE_MASK;
return start_paddr;
@@ -3825,8 +3852,6 @@ static int intel_map_sg(struct device *dev, struct 
scatterlist *sglist, int nele
return 0;
}
 
-   __mapping_notify_one(iommu, domain, start_vpfn, size);
-
return nelems;
 }
 
-- 
2.17.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 0/2] iommu/vt-d: Fix mapping PSI missing for iommu_map()

2018-05-03 Thread Peter Xu
v3:
- drop the pr_debug patch [Joerg]
- rename all the subjects as suggested [Joerg]
- rebase

v2:
- cc correct people and iommu list

(PSI stands for: Page Selective Invalidations)

Intel IOMMU has the caching mode to ease emulation of the device.
When that bit is set, we need to send PSIs even for newly mapped
pages.  However current driver is not fully obey the rule.  E.g.,
iommu_map() API will only do the mapping but it never sent the PSIs
before.  That can be problematic to emulated IOMMU devices since
they'll never be able to build up the shadow page tables if without
such information.  This patchset tries to fix the problem.

Patch 1 introduces a helper to notify the MAP PSIs.

Patch 2 fixes the real problem by making sure every domain mapping
will trigger the MAP PSI notifications.

Without the patchset, nested device assignment (assign one device
firstly to L1 guest, then to L2 guest) won't work for QEMU.  After
applying the patchset, it works.

Please review.  Thanks.

Peter Xu (2):
  iommu/vt-d: Introduce __mapping_notify_one()
  iommu/vt-d: Fix iotlb psi missing for mappings

 drivers/iommu/intel-iommu.c | 65 ++---
 1 file changed, 46 insertions(+), 19 deletions(-)

-- 
2.17.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 1/2] iommu/vt-d: Introduce __mapping_notify_one()

2018-05-03 Thread Peter Xu
Introduce this new helper to notify one newly created mapping on one
single IOMMU.  We can further leverage this helper in the next patch.

Signed-off-by: Peter Xu <pet...@redhat.com>
---
 drivers/iommu/intel-iommu.c | 26 ++
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 749d8f235346..13190a54aba2 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -1606,6 +1606,18 @@ static void iommu_flush_iotlb_psi(struct intel_iommu 
*iommu,
iommu_flush_dev_iotlb(domain, addr, mask);
 }
 
+/* Notification for newly created mappings */
+static inline void __mapping_notify_one(struct intel_iommu *iommu,
+   struct dmar_domain *domain,
+   unsigned long pfn, unsigned int pages)
+{
+   /* It's a non-present to present mapping. Only flush if caching mode */
+   if (cap_caching_mode(iommu->cap))
+   iommu_flush_iotlb_psi(iommu, domain, pfn, pages, 0, 1);
+   else
+   iommu_flush_write_buffer(iommu);
+}
+
 static void iommu_flush_iova(struct iova_domain *iovad)
 {
struct dmar_domain *domain;
@@ -3625,13 +3637,7 @@ static dma_addr_t __intel_map_single(struct device *dev, 
phys_addr_t paddr,
if (ret)
goto error;
 
-   /* it's a non-present to present mapping. Only flush if caching mode */
-   if (cap_caching_mode(iommu->cap))
-   iommu_flush_iotlb_psi(iommu, domain,
- mm_to_dma_pfn(iova_pfn),
- size, 0, 1);
-   else
-   iommu_flush_write_buffer(iommu);
+   __mapping_notify_one(iommu, domain, mm_to_dma_pfn(iova_pfn), size);
 
start_paddr = (phys_addr_t)iova_pfn << PAGE_SHIFT;
start_paddr += paddr & ~PAGE_MASK;
@@ -3819,11 +3825,7 @@ static int intel_map_sg(struct device *dev, struct 
scatterlist *sglist, int nele
return 0;
}
 
-   /* it's a non-present to present mapping. Only flush if caching mode */
-   if (cap_caching_mode(iommu->cap))
-   iommu_flush_iotlb_psi(iommu, domain, start_vpfn, size, 0, 1);
-   else
-   iommu_flush_write_buffer(iommu);
+   __mapping_notify_one(iommu, domain, start_vpfn, size);
 
return nelems;
 }
-- 
2.17.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 2/3] intel-iommu: generalize __mapping_notify_one()

2018-05-03 Thread Peter Xu
On Thu, May 03, 2018 at 03:14:24PM +0200, Joerg Roedel wrote:
> On Wed, Apr 18, 2018 at 04:39:52PM +0800, Peter Xu wrote:
> > Generalize this new helper to notify one newly created mapping on one
> > single IOMMU.  We can further leverage this helper in the next patch.
> 
> You introduce the function, you do not generalize it. Please fix that in
> the subject and description.
> 
> Also, please drop patch 1, I don't like this pr_debug stuff in the
> code. And please also fix the subject line in general to match the form:
> 
>   iommu/vt-d: _I_ntroduce __mapping_notify_one()
> 
> When that is done, please re-send.

Will do.  Thanks, Joerg.

-- 
Peter Xu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 2/3] intel-iommu: generalize __mapping_notify_one()

2018-04-18 Thread Peter Xu
Generalize this new helper to notify one newly created mapping on one
single IOMMU.  We can further leverage this helper in the next patch.

Signed-off-by: Peter Xu <pet...@redhat.com>
---
 drivers/iommu/intel-iommu.c | 26 ++
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index a64da83e867c..bf111e60857c 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -1607,6 +1607,18 @@ static void iommu_flush_iotlb_psi(struct intel_iommu 
*iommu,
iommu_flush_dev_iotlb(domain, addr, mask);
 }
 
+/* Notification for newly created mappings */
+static inline void __mapping_notify_one(struct intel_iommu *iommu,
+   struct dmar_domain *domain,
+   unsigned long pfn, unsigned int pages)
+{
+   /* It's a non-present to present mapping. Only flush if caching mode */
+   if (cap_caching_mode(iommu->cap))
+   iommu_flush_iotlb_psi(iommu, domain, pfn, pages, 0, 1);
+   else
+   iommu_flush_write_buffer(iommu);
+}
+
 static void iommu_flush_iova(struct iova_domain *iovad)
 {
struct dmar_domain *domain;
@@ -3626,13 +3638,7 @@ static dma_addr_t __intel_map_single(struct device *dev, 
phys_addr_t paddr,
if (ret)
goto error;
 
-   /* it's a non-present to present mapping. Only flush if caching mode */
-   if (cap_caching_mode(iommu->cap))
-   iommu_flush_iotlb_psi(iommu, domain,
- mm_to_dma_pfn(iova_pfn),
- size, 0, 1);
-   else
-   iommu_flush_write_buffer(iommu);
+   __mapping_notify_one(iommu, domain, mm_to_dma_pfn(iova_pfn), size);
 
start_paddr = (phys_addr_t)iova_pfn << PAGE_SHIFT;
start_paddr += paddr & ~PAGE_MASK;
@@ -3851,11 +3857,7 @@ static int intel_map_sg(struct device *dev, struct 
scatterlist *sglist, int nele
return 0;
}
 
-   /* it's a non-present to present mapping. Only flush if caching mode */
-   if (cap_caching_mode(iommu->cap))
-   iommu_flush_iotlb_psi(iommu, domain, start_vpfn, size, 0, 1);
-   else
-   iommu_flush_write_buffer(iommu);
+   __mapping_notify_one(iommu, domain, start_vpfn, size);
 
return nelems;
 }
-- 
2.14.3

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 3/3] intel-iommu: fix iotlb psi missing for mappings

2018-04-18 Thread Peter Xu
When caching mode is enabled for IOMMU, we should send explicit IOTLB
PSIs even for newly created mappings.  However these events are missing
for all intel_iommu_map() callers, e.g., iommu_map().  One direct user
is the vfio-pci driver.

To make sure we'll send the PSIs always when necessary, this patch
firstly introduced domain_mapping() helper for page mappings, then fixed
the problem by generalizing the explicit map IOTLB PSI logic into that
new helper. With that, we let iommu_domain_identity_map() to use the
simplified version to avoid sending the notifications, while for all the
rest of cases we send the notifications always.

For VM case, we send the PSIs to all the backend IOMMUs for the domain.

This patch allows the nested device assignment to work with QEMU (assign
device firstly to L1 guest, then assign it again to L2 guest).

Signed-off-by: Peter Xu <pet...@redhat.com>
---
 drivers/iommu/intel-iommu.c | 43 ++-
 1 file changed, 34 insertions(+), 9 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index bf111e60857c..eb0f0911342f 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -2353,18 +2353,47 @@ static int __domain_mapping(struct dmar_domain *domain, 
unsigned long iov_pfn,
return 0;
 }
 
+static int domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
+ struct scatterlist *sg, unsigned long phys_pfn,
+ unsigned long nr_pages, int prot)
+{
+   int ret;
+   struct intel_iommu *iommu;
+
+   /* Do the real mapping first */
+   ret = __domain_mapping(domain, iov_pfn, sg, phys_pfn, nr_pages, prot);
+   if (ret)
+   return ret;
+
+   /* Notify about the new mapping */
+   if (domain_type_is_vm(domain)) {
+  /* VM typed domains can have more than one IOMMUs */
+  int iommu_id;
+  for_each_domain_iommu(iommu_id, domain) {
+  iommu = g_iommus[iommu_id];
+  __mapping_notify_one(iommu, domain, iov_pfn, nr_pages);
+  }
+   } else {
+  /* General domains only have one IOMMU */
+  iommu = domain_get_iommu(domain);
+  __mapping_notify_one(iommu, domain, iov_pfn, nr_pages);
+   }
+
+   return 0;
+}
+
 static inline int domain_sg_mapping(struct dmar_domain *domain, unsigned long 
iov_pfn,
struct scatterlist *sg, unsigned long 
nr_pages,
int prot)
 {
-   return __domain_mapping(domain, iov_pfn, sg, 0, nr_pages, prot);
+   return domain_mapping(domain, iov_pfn, sg, 0, nr_pages, prot);
 }
 
 static inline int domain_pfn_mapping(struct dmar_domain *domain, unsigned long 
iov_pfn,
 unsigned long phys_pfn, unsigned long 
nr_pages,
 int prot)
 {
-   return __domain_mapping(domain, iov_pfn, NULL, phys_pfn, nr_pages, 
prot);
+   return domain_mapping(domain, iov_pfn, NULL, phys_pfn, nr_pages, prot);
 }
 
 static void domain_context_clear_one(struct intel_iommu *iommu, u8 bus, u8 
devfn)
@@ -2669,9 +2698,9 @@ static int iommu_domain_identity_map(struct dmar_domain 
*domain,
 */
dma_pte_clear_range(domain, first_vpfn, last_vpfn);
 
-   return domain_pfn_mapping(domain, first_vpfn, first_vpfn,
- last_vpfn - first_vpfn + 1,
- DMA_PTE_READ|DMA_PTE_WRITE);
+   return __domain_mapping(domain, first_vpfn, NULL,
+   first_vpfn, last_vpfn - first_vpfn + 1,
+   DMA_PTE_READ|DMA_PTE_WRITE);
 }
 
 static int domain_prepare_identity_map(struct device *dev,
@@ -3638,8 +3667,6 @@ static dma_addr_t __intel_map_single(struct device *dev, 
phys_addr_t paddr,
if (ret)
goto error;
 
-   __mapping_notify_one(iommu, domain, mm_to_dma_pfn(iova_pfn), size);
-
start_paddr = (phys_addr_t)iova_pfn << PAGE_SHIFT;
start_paddr += paddr & ~PAGE_MASK;
return start_paddr;
@@ -3857,8 +3884,6 @@ static int intel_map_sg(struct device *dev, struct 
scatterlist *sglist, int nele
return 0;
}
 
-   __mapping_notify_one(iommu, domain, start_vpfn, size);
-
return nelems;
 }
 
-- 
2.14.3

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 1/3] intel-iommu: add some traces for PSIs

2018-04-18 Thread Peter Xu
It is helpful to debug and triage PSI notification missings.

Signed-off-by: Peter Xu <pet...@redhat.com>
---
 drivers/iommu/dmar.c| 3 +++
 drivers/iommu/intel-iommu.c | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
index 9a7ffd13c7f0..62ae26c3f7b7 100644
--- a/drivers/iommu/dmar.c
+++ b/drivers/iommu/dmar.c
@@ -1325,6 +1325,9 @@ void qi_flush_iotlb(struct intel_iommu *iommu, u16 did, 
u64 addr,
struct qi_desc desc;
int ih = 0;
 
+   pr_debug("%s: iommu %d did %u addr 0x%llx order %u type %llx\n",
+__func__, iommu->seq_id, did, addr, size_order, type);
+
if (cap_write_drain(iommu->cap))
dw = 1;
 
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 582fd01cb7d1..a64da83e867c 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -1396,6 +1396,9 @@ static void __iommu_flush_iotlb(struct intel_iommu 
*iommu, u16 did,
u64 val = 0, val_iva = 0;
unsigned long flag;
 
+   pr_debug("%s: iommu %d did %u addr 0x%llx order %u type %llx\n",
+__func__, iommu->seq_id, did, addr, size_order, type);
+
switch (type) {
case DMA_TLB_GLOBAL_FLUSH:
/* global flush doesn't need set IVA_REG */
-- 
2.14.3

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 0/3] intel-iommu: fix mapping PSI missing for iommu_map()

2018-04-18 Thread Peter Xu
v2:
- cc correct people and iommu list

(PSI stands for: Page Selective Invalidations)

Intel IOMMU has the caching mode to ease emulation of the device.
When that bit is set, we need to send PSIs even for newly mapped
pages.  However current driver is not fully obey the rule.  E.g.,
iommu_map() API will only do the mapping but it never sent the PSIs
before.  That can be problematic to emulated IOMMU devices since
they'll never be able to build up the shadow page tables if without
such information.  This patchset tries to fix the problem.

Patch 1 is a tracing enhancement that helped me to triage the problem.
It might even be useful in the future.

Patch 2 generalized a helper to notify the MAP PSIs.

Patch 3 fixes the real problem by making sure every domain mapping
will trigger the MAP PSI notifications.

Without the patchset, nested device assignment (assign one device
firstly to L1 guest, then to L2 guest) won't work for QEMU.  After
applying the patchset, it works.

Please review.  Thanks.

Peter Xu (3):
  intel-iommu: add some traces for PSIs
  intel-iommu: generalize __mapping_notify_one()
  intel-iommu: fix iotlb psi missing for mappings

 drivers/iommu/dmar.c|  3 ++
 drivers/iommu/intel-iommu.c | 68 -
 2 files changed, 52 insertions(+), 19 deletions(-)

-- 
2.14.3

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH] iommu/vt-d: use domain instead of cache fetching

2018-01-09 Thread Peter Xu
after commit a1ddcbe93010 ("iommu/vt-d: Pass dmar_domain directly into
iommu_flush_iotlb_psi", 2015-08-12), we have domain pointer as parameter
to iommu_flush_iotlb_psi(), so no need to fetch it from cache again.

More importantly, a NULL reference pointer bug is reported on RHEL7 (and
it can be reproduced on some old upstream kernels too, e.g., v4.13) by
unplugging an 40g nic from a VM (hard to test unplug on real host, but
it should be the same):

https://bugzilla.redhat.com/show_bug.cgi?id=1531367

[   24.391863] pciehp :00:03.0:pcie004: Slot(0): Attention button pressed
[   24.393442] pciehp :00:03.0:pcie004: Slot(0): Powering off due to button 
press
[   29.721068] i40evf :01:00.0: Unable to send opcode 2 to PF, err 
I40E_ERR_QUEUE_EMPTY, aq_err OK
[   29.783557] iommu: Removing device :01:00.0 from group 3
[   29.784662] BUG: unable to handle kernel NULL pointer dereference at 
0304
[   29.785817] IP: iommu_flush_iotlb_psi+0xcf/0x120
[   29.786486] PGD 0
[   29.786487] P4D 0
[   29.786812]
[   29.787390] Oops:  [#1] SMP
[   29.787876] Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 
xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc 
ip6table_ng
[   29.795371] CPU: 0 PID: 156 Comm: kworker/0:2 Not tainted 4.13.0 #14
[   29.796366] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
1.11.0-1.el7 04/01/2014
[   29.797593] Workqueue: pciehp-0 pciehp_power_thread
[   29.798328] task: 94f5745b4a00 task.stack: b326805ac000
[   29.799178] RIP: 0010:iommu_flush_iotlb_psi+0xcf/0x120
[   29.799919] RSP: 0018:b326805afbd0 EFLAGS: 00010086
[   29.800666] RAX: 94f5bc56e800 RBX:  RCX: 00020025
[   29.801667] RDX: 94f5bc56e000 RSI: 0082 RDI: 
[   29.802755] RBP: b326805afbf8 R08:  R09: 94f5bc86bbf0
[   29.803772] R10: b326805afba8 R11: 000ffdc4 R12: 94f5bc86a400
[   29.804789] R13:  R14: ffdc4000 R15: 
[   29.805792] FS:  () GS:94f5bfc0() 
knlGS:
[   29.806923] CS:  0010 DS:  ES:  CR0: 80050033
[   29.807736] CR2: 0304 CR3: 3499d000 CR4: 06f0
[   29.808747] Call Trace:
[   29.809156]  flush_unmaps_timeout+0x126/0x1c0
[   29.809800]  domain_exit+0xd6/0x100
[   29.810322]  device_notifier+0x6b/0x70
[   29.810902]  notifier_call_chain+0x4a/0x70
[   29.812822]  __blocking_notifier_call_chain+0x47/0x60
[   29.814499]  blocking_notifier_call_chain+0x16/0x20
[   29.816137]  device_del+0x233/0x320
[   29.817588]  pci_remove_bus_device+0x6f/0x110
[   29.819133]  pci_stop_and_remove_bus_device+0x1a/0x20
[   29.820817]  pciehp_unconfigure_device+0x7a/0x1d0
[   29.822434]  pciehp_disable_slot+0x52/0xe0
[   29.823931]  pciehp_power_thread+0x8a/0xa0
[   29.825411]  process_one_work+0x18c/0x3a0
[   29.826875]  worker_thread+0x4e/0x3b0
[   29.828263]  kthread+0x109/0x140
[   29.829564]  ? process_one_work+0x3a0/0x3a0
[   29.831081]  ? kthread_park+0x60/0x60
[   29.832464]  ret_from_fork+0x25/0x30
[   29.833794] Code: 85 ed 74 0b 5b 41 5c 41 5d 41 5e 41 5f 5d c3 49 8b 54 24 
60 44 89 f8 0f b6 c4 48 8b 04 c2 48 85 c0 74 49 45 0f b6 ff 4a 8b 3c f8 <80> bf
[   29.838514] RIP: iommu_flush_iotlb_psi+0xcf/0x120 RSP: b326805afbd0
[   29.840362] CR2: 0304
[   29.841716] ---[ end trace b10ec0d6900868d3 ]---

This patch fixes that problem if applied to v4.13 kernel.

The bug does not exist on latest upstream kernel since it's fixed as a
side effect of commit 13cf01744608 ("iommu/vt-d: Make use of iova
deferred flushing", 2017-08-15).  But IMHO it's still good to have this
patch upstream.

CC: Alex Williamson <alex.william...@redhat.com>
Signed-off-by: Peter Xu <pet...@redhat.com>
---
 drivers/iommu/intel-iommu.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 4a2de34895ec..869f37d1f3b7 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -1601,8 +1601,7 @@ static void iommu_flush_iotlb_psi(struct intel_iommu 
*iommu,
 * flush. However, device IOTLB doesn't need to be flushed in this case.
 */
if (!cap_caching_mode(iommu->cap) || !map)
-   iommu_flush_dev_iotlb(get_iommu_domain(iommu, did),
- addr, mask);
+   iommu_flush_dev_iotlb(domain, addr, mask);
 }
 
 static void iommu_flush_iova(struct iova_domain *iovad)
-- 
2.14.3

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu/vt-d: Fix shift overflow in qi_flush_dev_iotlb

2017-12-14 Thread Peter Xu
On Wed, Dec 13, 2017 at 11:31:02AM -0600, Hook, Gary wrote:
> On 12/13/2017 11:15 AM, Alex Williamson wrote:
> > On Wed, 13 Dec 2017 10:41:47 -0600
> > "Hook, Gary" <gh...@amd.com> wrote:
> > 
> > > On 12/13/2017 9:58 AM, Alex Williamson wrote:
> > > > On Wed, 13 Dec 2017 15:13:55 +0800
> > > > Peter Xu <pet...@redhat.com> wrote:
> > > > > On Tue, Dec 12, 2017 at 03:43:08PM -0700, Alex Williamson wrote:
> > > > > 
> > > > > [...]
> > > > > > diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
> > > > > > index 9a7ffd13c7f0..87888b102057 100644
> > > > > > --- a/drivers/iommu/dmar.c
> > > > > > +++ b/drivers/iommu/dmar.c
> > > > > > @@ -1345,7 +1345,9 @@ void qi_flush_dev_iotlb(struct intel_iommu 
> > > > > > *iommu, u16 sid, u16 qdep,
> > > > > > struct qi_desc desc;
> > > > > > if (mask) {
> > > > > > -   BUG_ON(addr & ((1 << (VTD_PAGE_SHIFT + mask)) - 1));
> > > > > > +   BUG_ON((mask > MAX_AGAW_PFN_WIDTH) ||
> > > > > > +  ((mask == MAX_AGAW_PFN_WIDTH) && addr) ||
> > > > > > +  (addr & ((1 << (VTD_PAGE_SHIFT + mask)) - 1)));
> > > > > 
> > > > > Could it work if we just use 1ULL instead of 1 here?  Thanks,
> > > > 
> > > > In either case we're talking about shifting off the end of the
> > > > variable, which I understand to be undefined.  Right?  Thanks,
> > > 
> > > How so? Bits fall off the left (MSB) end, zeroes fill in the right (LSB)
> > > end. I believe that behavior is pretty set.
> > 
> > Maybe I'm relying too much on stackoverflow, but:
> > 
> > https://stackoverflow.com/questions/11270492/what-does-the-c-standard-say-about-bitshifting-more-bits-than-the-width-of-type
> 
> No, probably not. I don't have my copy of c99 handy, so can't check it. But
> it is beyond me why any compiler implementation would choose to use a rotate
> instead of a shift... probably a performance issue.
> 
> So, yeah, when you have silly parameters, you get what you get.
> 
> I'll stick to my suggestion. Which seems unambiguous... but I could be
> wrong.

Hi, Alex, Hook,

I did a quick test:

xz-mi:tmp $ cat a.c
#include 

void main(void)
{
unsigned long long val = 0x8001ULL;
int shift; 

printf("origin: 0x%llx\n", val);
shift = 1; printf("shl 1: 0x%llx\n", val << shift);
shift = 62; printf("shl 62: 0x%llx\n", val << shift);
shift = 63; printf("shl 63: 0x%llx\n", val << shift);
shift = 64; printf("shl 64: 0x%llx\n", val << shift);
shift = 65; printf("shl 65: 0x%llx\n", val << shift);
}
xz-mi:tmp $ gcc -std=c99 a.c
xz-mi:tmp $ ./a.out 
origin: 0x8001
shl 1: 0x2
shl 62: 0x4000
shl 63: 0x8000
shl 64: 0x8001
shl 65: 0x2
xz-mi:tmp $ objdump -d a.out | grep -A20 ""
004004d7 :
  4004d7:   55  push   %rbp
  4004d8:   48 89 e5mov%rsp,%rbp
  4004db:   48 83 ec 10 sub$0x10,%rsp
  4004df:   48 b8 01 00 00 00 00movabs $0x8001,%rax
  4004e6:   00 00 80 
  4004e9:   48 89 45 f8 mov%rax,-0x8(%rbp)
  4004ed:   48 8b 45 f8 mov-0x8(%rbp),%rax
  4004f1:   48 89 c6mov%rax,%rsi
  4004f4:   bf 60 06 40 00  mov$0x400660,%edi
  4004f9:   b8 00 00 00 00  mov$0x0,%eax
  4004fe:   e8 ed fe ff ff  callq  4003f0 <printf@plt>
  400503:   c7 45 f4 01 00 00 00movl   $0x1,-0xc(%rbp)
  40050a:   8b 45 f4    mov-0xc(%rbp),%eax
  40050d:   48 8b 55 f8 mov-0x8(%rbp),%rdx
  400511:   89 c1   mov%eax,%ecx
  400513:   48 d3 e2shl%cl,%rdx
  400516:   48 89 d0mov%rdx,%rax
  400519:   48 89 c6mov%rax,%rsi
  40051c:   bf 70 06 40 00  mov$0x400670,%edi
  400521:   b8 00 00 00 00  mov$0x0,%eax

So it seems not really a rotation operation, but it looks more like
convering a "shifting N" into "shifting N%64" when N>=64.  So now I
agree with Alex's change.  Thanks all.

-- 
Peter Xu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu/vt-d: Fix shift overflow in qi_flush_dev_iotlb

2017-12-12 Thread Peter Xu
On Tue, Dec 12, 2017 at 03:43:08PM -0700, Alex Williamson wrote:

[...]

> diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
> index 9a7ffd13c7f0..87888b102057 100644
> --- a/drivers/iommu/dmar.c
> +++ b/drivers/iommu/dmar.c
> @@ -1345,7 +1345,9 @@ void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 
> sid, u16 qdep,
>   struct qi_desc desc;
>  
>   if (mask) {
> - BUG_ON(addr & ((1 << (VTD_PAGE_SHIFT + mask)) - 1));
> + BUG_ON((mask > MAX_AGAW_PFN_WIDTH) ||
> +((mask == MAX_AGAW_PFN_WIDTH) && addr) ||
> +(addr & ((1 << (VTD_PAGE_SHIFT + mask)) - 1)));

Could it work if we just use 1ULL instead of 1 here?  Thanks,

-- 
Peter Xu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 03/20] intel_iommu: add "svm" option

2017-05-08 Thread Peter Xu
On Mon, May 08, 2017 at 10:38:09AM +, Liu, Yi L wrote:
> On Thu, 27 Apr 2017 18:53:17 +0800
> Peter Xu <pet...@redhat.com> wrote:
> 
> > On Wed, Apr 26, 2017 at 06:06:33PM +0800, Liu, Yi L wrote:
> > > Expose "Shared Virtual Memory" to guest by using "svm" option.
> > > Also use "svm" to expose SVM related capabilities to guest.
> > > e.g. "-device intel-iommu, svm=on"
> > >
> > > Signed-off-by: Liu, Yi L <yi.l@linux.intel.com>
> > > ---
> > >  hw/i386/intel_iommu.c  | 10 ++
> > >  hw/i386/intel_iommu_internal.h |  5 +
> > > include/hw/i386/intel_iommu.h  |  1 +
> > >  3 files changed, 16 insertions(+)
> > >
> > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c index
> > > bf98fa5..ba1e7eb 100644
> > > --- a/hw/i386/intel_iommu.c
> > > +++ b/hw/i386/intel_iommu.c
> > > @@ -2453,6 +2453,7 @@ static Property vtd_properties[] = {
> > >  DEFINE_PROP_BOOL("x-buggy-eim", IntelIOMMUState, buggy_eim, false),
> > >  DEFINE_PROP_BOOL("caching-mode", IntelIOMMUState, caching_mode,
> > FALSE),
> > >  DEFINE_PROP_BOOL("ecs", IntelIOMMUState, ecs, FALSE),
> > > +DEFINE_PROP_BOOL("svm", IntelIOMMUState, svm, FALSE),
> > >  DEFINE_PROP_END_OF_LIST(),
> > >  };
> > >
> > > @@ -2973,6 +2974,15 @@ static void vtd_init(IntelIOMMUState *s)
> > >  s->ecap |= VTD_ECAP_ECS;
> > >  }
> > >
> > > +if (s->svm) {
> > > +if (!s->ecs || !x86_iommu->pt_supported || !s->caching_mode) {
> > > +error_report("Need to set ecs, pt, caching-mode for svm");
> > > +exit(1);
> > > +}
> > > +s->cap |= VTD_CAP_DWD | VTD_CAP_DRD;
> > > +s->ecap |= VTD_ECAP_PRS | VTD_ECAP_PTS | VTD_ECAP_PASID28;
> > > +}
> > > +
> > >  if (s->caching_mode) {
> > >  s->cap |= VTD_CAP_CM;
> > >  }
> > > diff --git a/hw/i386/intel_iommu_internal.h
> > > b/hw/i386/intel_iommu_internal.h index 71a1c1e..f2a7d12 100644
> > > --- a/hw/i386/intel_iommu_internal.h
> > > +++ b/hw/i386/intel_iommu_internal.h
> > > @@ -191,6 +191,9 @@
> > >  #define VTD_ECAP_PT (1ULL << 6)
> > >  #define VTD_ECAP_MHMV   (15ULL << 20)
> > >  #define VTD_ECAP_ECS(1ULL << 24)
> > > +#define VTD_ECAP_PASID28(1ULL << 28)
> > 
> > Could I ask what's this bit? On my spec, it says this bit is reserved and 
> > defunct (spec
> > version: June 2016).
> 
> As Ashok confirmed, yes it should be bit 40. would update it.

Ok.

> 
> > > +#define VTD_ECAP_PRS(1ULL << 29)
> > > +#define VTD_ECAP_PTS(0xeULL << 35)
> > 
> > Would it better we avoid using 0xe here, or at least add some comment?
> 
> For this value, it must be no more than the bits host supports. So it may be
> better to have a default value and meanwhile expose an option to let user
> set it. how about your opinion?

I think a more important point is that we need to make sure this value
is no larger than hardware support? Since you are also working on the
vfio interface for virt-svm... would it be possible that we can talk
to kernel in some way so that we can know the supported pasid size in
host IOMMU? So that when guest specifies something bigger, we can stop
the user.

I don't know the practical value for this field, if it's static
enough, I think it's also okay we make it static here as well. But
again, I would prefer at least some comment, like:

  /* Value N indicates PASID field of N+1 bits, here 0xe stands for.. */

> 
> > 
> > >
> > >  /* CAP_REG */
> > >  /* (offset >> 4) << 24 */
> > > @@ -207,6 +210,8 @@
> > >  #define VTD_CAP_PSI (1ULL << 39)
> > >  #define VTD_CAP_SLLPS   ((1ULL << 34) | (1ULL << 35))
> > >  #define VTD_CAP_CM  (1ULL << 7)
> > > +#define VTD_CAP_DWD (1ULL << 54)
> > > +#define VTD_CAP_DRD (1ULL << 55)
> > 
> > Just to confirm: after this series, we should support drain read/write 
> > then, right?
> 
> I haven’t done special process against it in IOMMU emulator. It's set to keep
> consistence with VT-d spec since DWD and DRW is required capability when
> PASID it reported as Set. However, I think it should be fine if guest issue QI
> with drain read/write set in the descriptor. Host should be able to process 
> it.

I see. IIUC the point here is we need to deliver these requests to
host IOMMU, and I guess we need to be able to do this in a synchronous
way as well.

Thanks,

-- 
Peter Xu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC PATCH 03/20] intel_iommu: add "svm" option

2017-04-27 Thread Peter Xu
On Wed, Apr 26, 2017 at 06:06:33PM +0800, Liu, Yi L wrote:
> Expose "Shared Virtual Memory" to guest by using "svm" option.
> Also use "svm" to expose SVM related capabilities to guest.
> e.g. "-device intel-iommu, svm=on"
> 
> Signed-off-by: Liu, Yi L <yi.l@linux.intel.com>
> ---
>  hw/i386/intel_iommu.c  | 10 ++
>  hw/i386/intel_iommu_internal.h |  5 +
>  include/hw/i386/intel_iommu.h  |  1 +
>  3 files changed, 16 insertions(+)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index bf98fa5..ba1e7eb 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -2453,6 +2453,7 @@ static Property vtd_properties[] = {
>  DEFINE_PROP_BOOL("x-buggy-eim", IntelIOMMUState, buggy_eim, false),
>  DEFINE_PROP_BOOL("caching-mode", IntelIOMMUState, caching_mode, FALSE),
>  DEFINE_PROP_BOOL("ecs", IntelIOMMUState, ecs, FALSE),
> +DEFINE_PROP_BOOL("svm", IntelIOMMUState, svm, FALSE),
>  DEFINE_PROP_END_OF_LIST(),
>  };
>  
> @@ -2973,6 +2974,15 @@ static void vtd_init(IntelIOMMUState *s)
>  s->ecap |= VTD_ECAP_ECS;
>  }
>  
> +if (s->svm) {
> +if (!s->ecs || !x86_iommu->pt_supported || !s->caching_mode) {
> +error_report("Need to set ecs, pt, caching-mode for svm");
> +exit(1);
> +}
> +s->cap |= VTD_CAP_DWD | VTD_CAP_DRD;
> +s->ecap |= VTD_ECAP_PRS | VTD_ECAP_PTS | VTD_ECAP_PASID28;
> +}
> +
>  if (s->caching_mode) {
>  s->cap |= VTD_CAP_CM;
>  }
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index 71a1c1e..f2a7d12 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -191,6 +191,9 @@
>  #define VTD_ECAP_PT (1ULL << 6)
>  #define VTD_ECAP_MHMV   (15ULL << 20)
>  #define VTD_ECAP_ECS(1ULL << 24)
> +#define VTD_ECAP_PASID28(1ULL << 28)

Could I ask what's this bit? On my spec, it says this bit is reserved
and defunct (spec version: June 2016).

> +#define VTD_ECAP_PRS(1ULL << 29)
> +#define VTD_ECAP_PTS(0xeULL << 35)

Would it better we avoid using 0xe here, or at least add some comment?

>  
>  /* CAP_REG */
>  /* (offset >> 4) << 24 */
> @@ -207,6 +210,8 @@
>  #define VTD_CAP_PSI (1ULL << 39)
>  #define VTD_CAP_SLLPS   ((1ULL << 34) | (1ULL << 35))
>  #define VTD_CAP_CM  (1ULL << 7)
> +#define VTD_CAP_DWD (1ULL << 54)
> +#define VTD_CAP_DRD (1ULL << 55)

Just to confirm: after this series, we should support drain read/write
then, right?

Thanks,

>  
>  /* Supported Adjusted Guest Address Widths */
>  #define VTD_CAP_SAGAW_SHIFT 8
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index ae21fe5..8981615 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -267,6 +267,7 @@ struct IntelIOMMUState {
>  
>  bool caching_mode;  /* RO - is cap CM enabled? */
>  bool ecs;   /* Extended Context Support */
> +bool svm;   /* Shared Virtual Memory */
>  
>  dma_addr_t root;/* Current root table pointer */
>  bool root_extended; /* Type of root table (extended or not) 
> */
> -- 
> 1.9.1
> 

-- 
Peter Xu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [Qemu-devel] [RFC PATCH 12/20] Memory: Add func to fire pasidt_bind notifier

2017-04-27 Thread Peter Xu
On Thu, Apr 27, 2017 at 06:25:37PM +0800, Liu, Yi L wrote:
> On Thu, Apr 27, 2017 at 02:14:27PM +0800, Peter Xu wrote:
> > On Thu, Apr 27, 2017 at 10:37:19AM +0800, Liu, Yi L wrote:
> > > On Wed, Apr 26, 2017 at 03:50:16PM +0200, Paolo Bonzini wrote:
> > > > 
> > > > 
> > > > On 26/04/2017 12:06, Liu, Yi L wrote:
> > > > > +void memory_region_notify_iommu_svm_bind(MemoryRegion *mr,
> > > > > + void *data)
> > > > > +{
> > > > > +IOMMUNotifier *iommu_notifier;
> > > > > +IOMMUNotifierFlag request_flags;
> > > > > +
> > > > > +assert(memory_region_is_iommu(mr));
> > > > > +
> > > > > +/*TODO: support other bind requests with smaller gran,
> > > > > + * e.g. bind signle pasid entry
> > > > > + */
> > > > > +request_flags = IOMMU_NOTIFIER_SVM_PASIDT_BIND;
> > > > > +
> > > > > +QLIST_FOREACH(iommu_notifier, >iommu_notify, node) {
> > > > > +if (iommu_notifier->notifier_flags & request_flags) {
> > > > > +iommu_notifier->notify(iommu_notifier, data);
> > > > > +break;
> > > > > +}
> > > > > +}
> > > > 
> > > > Peter,
> > > > 
> > > > should this reuse ->notify, or should it be different function pointer
> > > > in IOMMUNotifier?
> > > 
> > > Hi Paolo,
> > > 
> > > Thx for your review.
> > > 
> > > I think it should be “->notify” here. In this patchset, the new notifier
> > > is registered with the existing notifier registration API. So the all the
> > > notifiers are in the mr->iommu_notify list. And notifiers are labeled
> > > by notify flag, so it is able to differentiate the IOMMUNotifier nodes.
> > > When the flag meets, trigger it by “->notify”. The diagram below shows
> > > my understanding , wish it helps to make me understood.
> > > 
> > > VFIOContainer
> > >|
> > >giommu_list(VFIOGuestIOMMU)
> > > \
> > >  VFIOGuestIOMMU1 ->   VFIOGuestIOMMU2 -> VFIOGuestIOMMU3 
> > > ...
> > > | | |
> > > mr->iommu_notify: IOMMUNotifier   ->IOMMUNotifier  ->  IOMMUNotifier
> > >   (Flag:MAP/UNMAP) (Flag:SVM bind)  (Flag:tlb 
> > > invalidate)
> > > 
> > > 
> > > Actually, compared with the MAP/UNMAP notifier, the newly added notifier 
> > > has
> > > no start/end check, and there may be other types of bind notfier flag in
> > > future, so I added a separate fire func for SVM bind notifier.
> > 
> > I agree with Paolo that this interface might not be the suitable place
> > for the SVM notifiers (just like what I worried about in previous
> > discussions).
> > 
> > The biggest problem is that, if you see current notifier mechanism,
> > it's per-memory-region. However iiuc your messages should be
> > per-iommu, or say, per translation unit.
> 
> Hi Peter,
> 
> yes, you're right. the newly added notifier is per-iommu.
> 
> > While, for each iommu, there
> > can be more than one memory regions (ppc can be an example). When
> > there are more than one MRs binded to the same iommu unit, which
> > memory region should you register to? Any one of them, or all?
> 
> Honestly, I'm not expert on ppc. According to the current code,
> I can only find one MR initialized with memory_region_init_iommu()
> in spapr_tce_table_realize(). So to better get your point, let me
> check. Do you mean there may be multiple of iommu MRs behind a iommu?

I am not either. :)

But yes, that's what I mean. At least that's how I understand it.

> 
> I admit it must be considered if there are multiple iommu MRs. I may
> choose to register for one of them since the notifier is per-iommu as
> you've pointed. Then vIOMMU emulator need to trigger the notifier with
> the correct MR. Not sure if ppc vIOMMU is fine with it.
> 
> > So my conclusion is, it just has nothing to do with memory regions...
> >
> > Instead of a different function pointer in IOMMUNotifer, IMHO we can
> > even move a step further, to isolate IOTLB notifications (targeted at
> > memory regions and with start/end ranges) out of SVM/other
> > notifications, since they are diffe

Re: [RFC PATCH 02/20] intel_iommu: exposed extended-context mode to guest

2017-04-27 Thread Peter Xu
 bus_num,
>  return ret_fr;
>  }
>  
> -if (!vtd_root_entry_present()) {
> +if (!vtd_root_entry_present() ||
> +(s->ecs && (devfn > 0x7f) && (!vtd_root_entry_upper_present( {
>  /* Not error - it's okay we don't have root entry. */
>  trace_vtd_re_not_present(bus_num);
>  return -VTD_FR_ROOT_ENTRY_P;
> -} else if (re.rsvd || (re.val & VTD_ROOT_ENTRY_RSVD)) {
> -trace_vtd_re_invalid(re.rsvd, re.val);
> -return -VTD_FR_ROOT_ENTRY_RSVD;
> +}
> +if ((s->ecs && (devfn > 0x7f) && (re.rsvd & VTD_ROOT_ENTRY_RSVD)) ||
> +(s->ecs && (devfn < 0x80) && (re.val & VTD_ROOT_ENTRY_RSVD)) ||
> +((!s->ecs) && (re.rsvd || (re.val & VTD_ROOT_ENTRY_RSVD {
> +trace_vtd_re_invalid(re.rsvd, re.val);
> +return -VTD_FR_ROOT_ENTRY_RSVD;

Nit: I feel like we can better wrap these 0x7f and 0x80 into helper
functions, especially if with above structure change...

(will hold here...)

Thanks,

-- 
Peter Xu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [Qemu-devel] [RFC PATCH 12/20] Memory: Add func to fire pasidt_bind notifier

2017-04-27 Thread Peter Xu
On Thu, Apr 27, 2017 at 02:14:27PM +0800, Peter Xu wrote:
> On Thu, Apr 27, 2017 at 10:37:19AM +0800, Liu, Yi L wrote:
> > On Wed, Apr 26, 2017 at 03:50:16PM +0200, Paolo Bonzini wrote:
> > > 
> > > 
> > > On 26/04/2017 12:06, Liu, Yi L wrote:
> > > > +void memory_region_notify_iommu_svm_bind(MemoryRegion *mr,
> > > > + void *data)
> > > > +{
> > > > +IOMMUNotifier *iommu_notifier;
> > > > +IOMMUNotifierFlag request_flags;
> > > > +
> > > > +assert(memory_region_is_iommu(mr));
> > > > +
> > > > +/*TODO: support other bind requests with smaller gran,
> > > > + * e.g. bind signle pasid entry
> > > > + */
> > > > +request_flags = IOMMU_NOTIFIER_SVM_PASIDT_BIND;
> > > > +
> > > > +QLIST_FOREACH(iommu_notifier, >iommu_notify, node) {
> > > > +if (iommu_notifier->notifier_flags & request_flags) {
> > > > +iommu_notifier->notify(iommu_notifier, data);
> > > > +break;
> > > > +}
> > > > +}
> > > 
> > > Peter,
> > > 
> > > should this reuse ->notify, or should it be different function pointer
> > > in IOMMUNotifier?
> > 
> > Hi Paolo,
> > 
> > Thx for your review.
> > 
> > I think it should be “->notify” here. In this patchset, the new notifier
> > is registered with the existing notifier registration API. So the all the
> > notifiers are in the mr->iommu_notify list. And notifiers are labeled
> > by notify flag, so it is able to differentiate the IOMMUNotifier nodes.
> > When the flag meets, trigger it by “->notify”. The diagram below shows
> > my understanding , wish it helps to make me understood.
> > 
> > VFIOContainer
> >|
> >giommu_list(VFIOGuestIOMMU)
> > \
> >  VFIOGuestIOMMU1 ->   VFIOGuestIOMMU2 -> VFIOGuestIOMMU3 ...
> > | | |
> > mr->iommu_notify: IOMMUNotifier   ->IOMMUNotifier  ->  IOMMUNotifier
> >   (Flag:MAP/UNMAP) (Flag:SVM bind)  (Flag:tlb 
> > invalidate)
> > 
> > 
> > Actually, compared with the MAP/UNMAP notifier, the newly added notifier has
> > no start/end check, and there may be other types of bind notfier flag in
> > future, so I added a separate fire func for SVM bind notifier.
> 
> I agree with Paolo that this interface might not be the suitable place
> for the SVM notifiers (just like what I worried about in previous
> discussions).
> 
> The biggest problem is that, if you see current notifier mechanism,
> it's per-memory-region. However iiuc your messages should be
> per-iommu, or say, per translation unit. While, for each iommu, there
> can be more than one memory regions (ppc can be an example). When
> there are more than one MRs binded to the same iommu unit, which
> memory region should you register to? Any one of them, or all?
> 
> So my conclusion is, it just has nothing to do with memory regions...
> 
> Instead of a different function pointer in IOMMUNotifer, IMHO we can
> even move a step further, to isolate IOTLB notifications (targeted at
> memory regions and with start/end ranges) out of SVM/other
> notifications, since they are different in general. So we basically
> need two notification mechanism:
> 
> - one for memory regions, currently what I can see is IOTLB
>   notifications
> 
> - one for translation units, currently I see all the rest of
>   notifications needed in virt-svm in this category
> 
> Maybe some RFC patches would be good to show what I mean... I'll see
> whether I can prepare some.

Here it is (on qemu-devel):

[RFC PATCH 0/8] IOMMU: introduce common IOMMUObject

Thanks,

-- 
Peter Xu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [Qemu-devel] [RFC PATCH 12/20] Memory: Add func to fire pasidt_bind notifier

2017-04-27 Thread Peter Xu
On Thu, Apr 27, 2017 at 10:37:19AM +0800, Liu, Yi L wrote:
> On Wed, Apr 26, 2017 at 03:50:16PM +0200, Paolo Bonzini wrote:
> > 
> > 
> > On 26/04/2017 12:06, Liu, Yi L wrote:
> > > +void memory_region_notify_iommu_svm_bind(MemoryRegion *mr,
> > > + void *data)
> > > +{
> > > +IOMMUNotifier *iommu_notifier;
> > > +IOMMUNotifierFlag request_flags;
> > > +
> > > +assert(memory_region_is_iommu(mr));
> > > +
> > > +/*TODO: support other bind requests with smaller gran,
> > > + * e.g. bind signle pasid entry
> > > + */
> > > +request_flags = IOMMU_NOTIFIER_SVM_PASIDT_BIND;
> > > +
> > > +QLIST_FOREACH(iommu_notifier, >iommu_notify, node) {
> > > +if (iommu_notifier->notifier_flags & request_flags) {
> > > +iommu_notifier->notify(iommu_notifier, data);
> > > +break;
> > > +}
> > > +}
> > 
> > Peter,
> > 
> > should this reuse ->notify, or should it be different function pointer
> > in IOMMUNotifier?
> 
> Hi Paolo,
> 
> Thx for your review.
> 
> I think it should be “->notify” here. In this patchset, the new notifier
> is registered with the existing notifier registration API. So the all the
> notifiers are in the mr->iommu_notify list. And notifiers are labeled
> by notify flag, so it is able to differentiate the IOMMUNotifier nodes.
> When the flag meets, trigger it by “->notify”. The diagram below shows
> my understanding , wish it helps to make me understood.
> 
> VFIOContainer
>|
>giommu_list(VFIOGuestIOMMU)
> \
>  VFIOGuestIOMMU1 ->   VFIOGuestIOMMU2 -> VFIOGuestIOMMU3 ...
> | | |
> mr->iommu_notify: IOMMUNotifier   ->IOMMUNotifier  ->  IOMMUNotifier
>   (Flag:MAP/UNMAP) (Flag:SVM bind)  (Flag:tlb invalidate)
> 
> 
> Actually, compared with the MAP/UNMAP notifier, the newly added notifier has
> no start/end check, and there may be other types of bind notfier flag in
> future, so I added a separate fire func for SVM bind notifier.

I agree with Paolo that this interface might not be the suitable place
for the SVM notifiers (just like what I worried about in previous
discussions).

The biggest problem is that, if you see current notifier mechanism,
it's per-memory-region. However iiuc your messages should be
per-iommu, or say, per translation unit. While, for each iommu, there
can be more than one memory regions (ppc can be an example). When
there are more than one MRs binded to the same iommu unit, which
memory region should you register to? Any one of them, or all?

So my conclusion is, it just has nothing to do with memory regions...

Instead of a different function pointer in IOMMUNotifer, IMHO we can
even move a step further, to isolate IOTLB notifications (targeted at
memory regions and with start/end ranges) out of SVM/other
notifications, since they are different in general. So we basically
need two notification mechanism:

- one for memory regions, currently what I can see is IOTLB
  notifications

- one for translation units, currently I see all the rest of
  notifications needed in virt-svm in this category

Maybe some RFC patches would be good to show what I mean... I'll see
whether I can prepare some.

Thanks,

-- 
Peter Xu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu