Re: [PATCH v2 00/15] vfio: expose virtual Shared Virtual Addressing to VMs
On Tue, Jun 16, 2020 at 04:49:28PM +0100, Stefan Hajnoczi wrote: > Isolation between applications is preserved but there is no isolation > between the device and the application itself. The application needs to > trust the device. > > Examples: > > 1. The device can snoop secret data from readable pages in the >application's virtual memory space. > > 2. The device can gain arbitrary execution on the CPU by overwriting >control flow addresses (e.g. function pointers, stack return >addresses) in writable pages. To me, SVA seems to be that "middle layer" of secure where it's not as safe as VFIO_IOMMU_MAP_DMA which has buffer level granularity of control (but of course we pay overhead on buffer setups and on-the-fly translations), however it's far better than DMA with no IOMMU which can ruin the whole host/guest, because after all we do a lot of isolations as process based. IMHO it's the same as when we see a VM (or the QEMU process) as a whole along with the guest code. In some cases we don't care if the guest did some bad things to mess up with its own QEMU process. It is still ideal if we can even stop the guest from doing so, but when it's not easy to do it the ideal way, we just lower the requirement to not spread the influence to the host and other VMs. Thanks, -- Peter Xu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH 2/4] iommu/vt-d: Add first level page table interfaces
On Sat, Sep 28, 2019 at 04:23:16PM +0800, Lu Baolu wrote: > Hi Peter, > > On 9/27/19 1:34 PM, Peter Xu wrote: > > Hi, Baolu, > > > > On Fri, Sep 27, 2019 at 10:27:24AM +0800, Lu Baolu wrote: > > > > > > > + spin_lock(&(domain)->page_table_lock); > > > > > > > \ > > > > > > > > > > > > Is this intended to lock here instead of taking the lock during the > > > > > > whole page table walk? Is it safe? > > > > > > > > > > > > Taking the example where nm==PTE: when we reach here how do we > > > > > > guarantee that the PMD page that has this PTE is still valid? > > > > > > > > > > We will always keep the non-leaf pages in the table, > > > > > > > > I see. Though, could I ask why? It seems to me that the existing 2nd > > > > level page table does not keep these when unmap, and it's not even use > > > > locking at all by leveraging cmpxchg()? > > > > > > I still need some time to understand how cmpxchg() solves the race issue > > > when reclaims pages. For example. > > > > > > Thread A Thread B > > > -A1: check all PTE's empty-B1: up-level PDE valid > > > -A2: clear the up-level PDE > > > -A3: reclaim the page -B2: populate the PTEs > > > > > > Both (A1,A2) and (B1,B2) should be atomic. Otherwise, race could happen. > > > > I'm not sure of this, but IMHO it is similarly because we need to > > allocate the iova ranges from iova allocator first, so thread A (who's > > going to unmap pages) and thread B (who's going to map new pages) > > should never have collapsed regions if happening concurrently. I'm > > Although they don't collapse, they might share a same pmd entry. If A > cleared the pmd entry and B goes ahead with populating the pte's. It > will crash. My understanding is that if A was not owning all the pages on that PMD entry then it will never free the page that was backing that PMD entry. Please refer to the code in dma_pte_clear_level() where it has: /* If range covers entire pagetable, free it */ if (start_pfn <= level_pfn && last_pfn >= level_pfn + level_size(level) - 1) { ... } else { ... } Note that when going into the else block, the PMD won't be freed but only the PTEs that upon the PMD will be cleared. In the case you mentioned above, IMHO it should go into that else block. Say, thread A must not contain the whole range of that PMD otherwise thread B won't get allocated with pages within that range covered by the same PMD. Thanks, -- Peter Xu
Re: [RFC PATCH 2/4] iommu/vt-d: Add first level page table interfaces
Hi, Baolu, On Fri, Sep 27, 2019 at 10:27:24AM +0800, Lu Baolu wrote: > > > > > + spin_lock(&(domain)->page_table_lock); > > > > > \ > > > > > > > > Is this intended to lock here instead of taking the lock during the > > > > whole page table walk? Is it safe? > > > > > > > > Taking the example where nm==PTE: when we reach here how do we > > > > guarantee that the PMD page that has this PTE is still valid? > > > > > > We will always keep the non-leaf pages in the table, > > > > I see. Though, could I ask why? It seems to me that the existing 2nd > > level page table does not keep these when unmap, and it's not even use > > locking at all by leveraging cmpxchg()? > > I still need some time to understand how cmpxchg() solves the race issue > when reclaims pages. For example. > > Thread A Thread B > -A1: check all PTE's empty-B1: up-level PDE valid > -A2: clear the up-level PDE > -A3: reclaim the page -B2: populate the PTEs > > Both (A1,A2) and (B1,B2) should be atomic. Otherwise, race could happen. I'm not sure of this, but IMHO it is similarly because we need to allocate the iova ranges from iova allocator first, so thread A (who's going to unmap pages) and thread B (who's going to map new pages) should never have collapsed regions if happening concurrently. I'm referring to intel_unmap() in which we won't free the iova region before domain_unmap() completes (which should cover the whole process of A1-A3) so the same iova range to be unmapped won't be allocated to any new pages in some other thread. There's also a hint in domain_unmap(): /* we don't need lock here; nobody else touches the iova range */ > > Actually, the iova allocator always packs IOVA ranges close to the top > of the address space. This results in requiring a minimal number of > pages to map the allocated IOVA ranges, which makes memory onsumption > by IOMMU page tables tolerable. Hence, we don't need to reclaim the > pages until the whole page table is about to tear down. The real data > on my test machine also improves this. Do you mean you have run the code with a 1st-level-supported IOMMU hardware? IMHO any data point would be good to be in the cover letter as reference. [...] > > > > > +static struct page * > > > > > +mmunmap_pte_range(struct dmar_domain *domain, pmd_t *pmd, > > > > > + unsigned long addr, unsigned long end, > > > > > + struct page *freelist, bool reclaim) > > > > > +{ > > > > > + int i; > > > > > + unsigned long start; > > > > > + pte_t *pte, *first_pte; > > > > > + > > > > > + start = addr; > > > > > + pte = pte_offset_kernel(pmd, addr); > > > > > + first_pte = pte; > > > > > + do { > > > > > + set_pte(pte, __pte(0)); > > > > > + } while (pte++, addr += PAGE_SIZE, addr != end); > > > > > + > > > > > + domain_flush_cache(domain, first_pte, (void *)pte - (void > > > > > *)first_pte); > > > > > + > > > > > + /* Add page to free list if all entries are empty. */ > > > > > + if (reclaim) { > > > > > > > > Shouldn't we know whether to reclaim if with (addr, end) specified as > > > > long as they cover the whole range of this PMD? > > > > > > Current policy is that we don't reclaim any pages until the whole page > > > table will be torn down. > > > > Ah OK. But I saw that you're passing in relaim==!start_addr. > > Shouldn't that errornously trigger if one wants to unmap the 1st page > > as well even if not the whole address space? > > IOVA 0 is assumed to be reserved by the allocator. Otherwise, we have no > means to check whether a IOVA is valid. Is this an assumption of the allocator? Could that change in the future? IMHO that's not necessary if so, after all it's as simple as replacing (!start_addr) with (start == 0 && end == END). I see that in domain_unmap() it has a similar check when freeing pgd: if (start_pfn == 0 && last_pfn == DOMAIN_MAX_PFN(domain->gaw)) Thanks, -- Peter Xu
Re: [RFC PATCH 2/4] iommu/vt-d: Add first level page table interfaces
LE));\ > > > > It seems to me that PV could trap calls to set_pte(). Then these > > could also be trapped by e.g. Xen? Are these traps needed? Is there > > side effect? I'm totally not familiar with this, but just ask aloud... > > Good catch. But I don't think a vIOMMU could get a chance to run in a PV > environment. I might miss something? I don't know... Is there reason to not allow a Xen guest to use 1st level mapping? While on the other side... If the PV interface will never be used, then could native_set_##nm() be used directly? [...] > > > +static struct page * > > > +mmunmap_pte_range(struct dmar_domain *domain, pmd_t *pmd, > > > + unsigned long addr, unsigned long end, > > > + struct page *freelist, bool reclaim) > > > +{ > > > + int i; > > > + unsigned long start; > > > + pte_t *pte, *first_pte; > > > + > > > + start = addr; > > > + pte = pte_offset_kernel(pmd, addr); > > > + first_pte = pte; > > > + do { > > > + set_pte(pte, __pte(0)); > > > + } while (pte++, addr += PAGE_SIZE, addr != end); > > > + > > > + domain_flush_cache(domain, first_pte, (void *)pte - (void *)first_pte); > > > + > > > + /* Add page to free list if all entries are empty. */ > > > + if (reclaim) { > > > > Shouldn't we know whether to reclaim if with (addr, end) specified as > > long as they cover the whole range of this PMD? > > Current policy is that we don't reclaim any pages until the whole page > table will be torn down. Ah OK. But I saw that you're passing in relaim==!start_addr. Shouldn't that errornously trigger if one wants to unmap the 1st page as well even if not the whole address space? > The gain is that we don't have to use a > spinlock when map/unmap a pmd entry anymore. So this question should also related to above on the locking - have you thought about using the same way (IIUC) as the 2nd level page table to use cmpxchg()? AFAIU that does not need any lock? For me it's perfectly fine to use a lock at least for initial version, I just want to know the considerations behind in case I missed anything important. Thanks, -- Peter Xu
Re: [RFC PATCH 0/4] Use 1st-level for DMA remapping in guest
On Wed, Sep 25, 2019 at 08:02:23AM +, Tian, Kevin wrote: > > From: Peter Xu [mailto:pet...@redhat.com] > > Sent: Wednesday, September 25, 2019 3:45 PM > > > > On Wed, Sep 25, 2019 at 07:21:51AM +, Tian, Kevin wrote: > > > > From: Peter Xu [mailto:pet...@redhat.com] > > > > Sent: Wednesday, September 25, 2019 2:57 PM > > > > > > > > On Wed, Sep 25, 2019 at 10:48:32AM +0800, Lu Baolu wrote: > > > > > Hi Kevin, > > > > > > > > > > On 9/24/19 3:00 PM, Tian, Kevin wrote: > > > > > > > > > '---' > > > > > > > > > '---' > > > > > > > > > > > > > > > > > > This patch series only aims to achieve the first goal, a.k.a > > > > > > > > > using > > > > > > first goal? then what are other goals? I didn't spot such > > > > > > information. > > > > > > > > > > > > > > > > The overall goal is to use IOMMU nested mode to avoid shadow page > > > > table > > > > > and VMEXIT when map an gIOVA. This includes below 4 steps (maybe > > not > > > > > accurate, but you could get the point.) > > > > > > > > > > 1) GIOVA mappings over 1st-level page table; > > > > > 2) binding vIOMMU 1st level page table to the pIOMMU; > > > > > 3) using pIOMMU second level for GPA->HPA translation; > > > > > 4) enable nested (a.k.a. dual stage) translation in host. > > > > > > > > > > This patch set aims to achieve 1). > > > > > > > > Would it make sense to use 1st level even for bare-metal to replace > > > > the 2nd level? > > > > > > > > What I'm thinking is the DPDK apps - they have MMU page table already > > > > there for the huge pages, then if they can use 1st level as the > > > > default device page table then it even does not need to map, because > > > > it can simply bind the process root page table pointer to the 1st > > > > level page root pointer of the device contexts that it uses. > > > > > > > > > > Then you need bear with possible page faults from using CPU page > > > table, while most devices don't support it today. > > > > Right, I was just thinking aloud. After all neither do we have IOMMU > > hardware to support 1st level (or am I wrong?)... It's just that when > > You are right. Current VT-d supports only 2nd level. > > > the 1st level is ready it should sound doable because IIUC PRI should > > be always with the 1st level support no matter on IOMMU side or the > > device side? > > No. PRI is not tied to 1st or 2nd level. Actually from device p.o.v, it's > just a protocol to trigger page fault, but the device doesn't care whether > the page fault is on 1st or 2nd level in the IOMMU side. The only > relevant part is that a PRI request can have PASID tagged or cleared. > When it's tagged with PASID, the IOMMU will locate the translation > table under the given PASID (either 1st or 2nd level is fine, according > to PASID entry setting). When no PASID is included, the IOMMU locates > the translation from default entry (e.g. PASID#0 or any PASID contained > in RID2PASID in VT-d). > > Your knowledge happened to be correct in deprecated ECS mode. At > that time, there is only one 2nd level per context entry which doesn't > support page fault, and there is only one 1st level per PASID entry which > supports page fault. Then PRI could be indirectly connected to 1st level, > but this just changed with new scalable mode. > > Another note is that the PRI capability only indicates that a device is > capable of handling page faults, but not that a device can tolerate > page fault for any of its DMA access. If the latter is fasle, using CPU > page table for DPDK usage is still risky (and specific to device behavior) > > > > > I'm actually not sure about whether my understanding here is > > correct... I thought the pasid binding previously was only for some > > vendor kernel drivers but not a general thing to userspace. I feel > > like that should be doable in the future once we've got some new > > syscall interface ready to deliver 1st level page table (e.g., via > > vfio?) then applications like DPDK seems to be able to use that too > > even directly via bare metal. > > > > using 1st level for userspace is different from supporting DMA page > fault in userspace. The former is purely about which structure to > keep the mapping. I think we may do the same thing for both bare > metal and guest (using 2nd level only for GPA when nested is enabled > on the IOMMU). But reusing CPU page table for userspace is more > tricky. :-) Yes I should have mixed up the 1st level page table and PRI a bit, and after all my initial question should be irrelevant to this series as well so it's already a bit out of topic (sorry for that). And, thanks for explaining these. :) -- Peter Xu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH 2/4] iommu/vt-d: Add first level page table interfaces
On Wed, Sep 25, 2019 at 07:32:48AM +, Tian, Kevin wrote: > > From: Lu Baolu [mailto:baolu...@linux.intel.com] > > Sent: Wednesday, September 25, 2019 2:52 PM > > > > Hi Peter and Kevin, > > > > On 9/25/19 1:24 PM, Peter Xu wrote: > > > On Wed, Sep 25, 2019 at 04:38:31AM +0000, Tian, Kevin wrote: > > >>> From: Peter Xu [mailto:pet...@redhat.com] > > >>> Sent: Wednesday, September 25, 2019 12:31 PM > > >>> > > >>> On Tue, Sep 24, 2019 at 09:38:53AM +0800, Lu Baolu wrote: > > >>>>>> intel_mmmap_range(domain, addr, end, phys_addr, prot) > > >>>>> > > >>>>> Maybe think of a different name..? mmmap seems a bit weird :-) > > >>>> > > >>>> Yes. I don't like it either. I've thought about it and haven't > > >>>> figured out a satisfied one. Do you have any suggestions? > > >>> > > >>> How about at least split the word using "_"? Like "mm_map", then > > >>> apply it to all the "mmm*" prefixes. Otherwise it'll be easily > > >>> misread as mmap() which is totally irrelevant to this... > > >>> > > >> > > >> what is the point of keeping 'mm' here? replace it with 'iommu'? > > > > > > I'm not sure of what Baolu thought, but to me "mm" makes sense itself > > > to identify this from real IOMMU page tables (because IIUC these will > > > be MMU page tables). We can come up with better names, but IMHO > > > "iommu" can be a bit misleading to let people refer to the 2nd level > > > page table. > > > > "mm" represents a CPU (first level) page table; > > > > vs. > > > > "io" represents an IOMMU (second level) page table. > > > > IOMMU first level is not equivalent to CPU page table, though you can > use the latter as the first level (e.g. in SVA). Especially here you are > making IOVA->GPA as the first level, which is not CPU page table. > > btw both levels are for "io" i.e. DMA purposes from VT-d p.o.v. They > are just hierarchical structures implemented by VT-d, with slightly > different format. Regarding to the "slightly different format", do you mean the extended-accessed bit? Even if there are differences, they do look very similar. If you see the same chap 9.7 table, the elements are exactly called PTE, PDE, PDPE, and so on - they're named exactly the same as MMU page tables. With that, IMHO it still sounds reasonable if we want to relate this "1st level iommu page table" with the existing MMU page table using the "mm" prefix... Regards, -- Peter Xu
Re: [RFC PATCH 0/4] Use 1st-level for DMA remapping in guest
On Wed, Sep 25, 2019 at 07:21:51AM +, Tian, Kevin wrote: > > From: Peter Xu [mailto:pet...@redhat.com] > > Sent: Wednesday, September 25, 2019 2:57 PM > > > > On Wed, Sep 25, 2019 at 10:48:32AM +0800, Lu Baolu wrote: > > > Hi Kevin, > > > > > > On 9/24/19 3:00 PM, Tian, Kevin wrote: > > > > > > > '---' > > > > > > > '---' > > > > > > > > > > > > > > This patch series only aims to achieve the first goal, a.k.a using > > > > first goal? then what are other goals? I didn't spot such information. > > > > > > > > > > The overall goal is to use IOMMU nested mode to avoid shadow page > > table > > > and VMEXIT when map an gIOVA. This includes below 4 steps (maybe not > > > accurate, but you could get the point.) > > > > > > 1) GIOVA mappings over 1st-level page table; > > > 2) binding vIOMMU 1st level page table to the pIOMMU; > > > 3) using pIOMMU second level for GPA->HPA translation; > > > 4) enable nested (a.k.a. dual stage) translation in host. > > > > > > This patch set aims to achieve 1). > > > > Would it make sense to use 1st level even for bare-metal to replace > > the 2nd level? > > > > What I'm thinking is the DPDK apps - they have MMU page table already > > there for the huge pages, then if they can use 1st level as the > > default device page table then it even does not need to map, because > > it can simply bind the process root page table pointer to the 1st > > level page root pointer of the device contexts that it uses. > > > > Then you need bear with possible page faults from using CPU page > table, while most devices don't support it today. Right, I was just thinking aloud. After all neither do we have IOMMU hardware to support 1st level (or am I wrong?)... It's just that when the 1st level is ready it should sound doable because IIUC PRI should be always with the 1st level support no matter on IOMMU side or the device side? I'm actually not sure about whether my understanding here is correct... I thought the pasid binding previously was only for some vendor kernel drivers but not a general thing to userspace. I feel like that should be doable in the future once we've got some new syscall interface ready to deliver 1st level page table (e.g., via vfio?) then applications like DPDK seems to be able to use that too even directly via bare metal. Regards, -- Peter Xu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH 0/4] Use 1st-level for DMA remapping in guest
On Wed, Sep 25, 2019 at 10:48:32AM +0800, Lu Baolu wrote: > Hi Kevin, > > On 9/24/19 3:00 PM, Tian, Kevin wrote: > > > > > '---' > > > > > '---' > > > > > > > > > > This patch series only aims to achieve the first goal, a.k.a using > > first goal? then what are other goals? I didn't spot such information. > > > > The overall goal is to use IOMMU nested mode to avoid shadow page table > and VMEXIT when map an gIOVA. This includes below 4 steps (maybe not > accurate, but you could get the point.) > > 1) GIOVA mappings over 1st-level page table; > 2) binding vIOMMU 1st level page table to the pIOMMU; > 3) using pIOMMU second level for GPA->HPA translation; > 4) enable nested (a.k.a. dual stage) translation in host. > > This patch set aims to achieve 1). Would it make sense to use 1st level even for bare-metal to replace the 2nd level? What I'm thinking is the DPDK apps - they have MMU page table already there for the huge pages, then if they can use 1st level as the default device page table then it even does not need to map, because it can simply bind the process root page table pointer to the 1st level page root pointer of the device contexts that it uses. Regards, -- Peter Xu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH 4/4] iommu/vt-d: Identify domains using first level page table
On Mon, Sep 23, 2019 at 08:24:54PM +0800, Lu Baolu wrote: > +/* > + * Check and return whether first level is used by default for > + * DMA translation. > + */ > +static bool first_level_by_default(void) > +{ > + struct dmar_drhd_unit *drhd; > + struct intel_iommu *iommu; > + > + rcu_read_lock(); > + for_each_active_iommu(iommu, drhd) > + if (!sm_supported(iommu) || > + !ecap_flts(iommu->ecap) || > + !cap_caching_mode(iommu->cap)) > + return false; > + rcu_read_unlock(); > + > + return true; > +} "If no caching mode, then we will not use 1st level." Hmm, does the vIOMMU needs to support caching-mode if with the solution you proposed here? Caching mode is only necessary for shadowing AFAICT, and after all you're going to use full-nested, then... then I would think it's not needed. And if so, with this patch 1st level will be disabled. Sounds like a paradox... I'm thinking what would be the big picture for this to work now: For the vIOMMU, instead of exposing the caching-mode, I'm thinking maybe we should expose it with ecap.FLTS=1 while we can keep ecap.SLTS=0 then it's natural that we can only use 1st level translation in the guest for all the domains (and I assume such an ecap value should never happen on real hardware, am I right?). Regards, -- Peter Xu
Re: [RFC PATCH 2/4] iommu/vt-d: Add first level page table interfaces
On Wed, Sep 25, 2019 at 04:38:31AM +, Tian, Kevin wrote: > > From: Peter Xu [mailto:pet...@redhat.com] > > Sent: Wednesday, September 25, 2019 12:31 PM > > > > On Tue, Sep 24, 2019 at 09:38:53AM +0800, Lu Baolu wrote: > > > > > intel_mmmap_range(domain, addr, end, phys_addr, prot) > > > > > > > > Maybe think of a different name..? mmmap seems a bit weird :-) > > > > > > Yes. I don't like it either. I've thought about it and haven't > > > figured out a satisfied one. Do you have any suggestions? > > > > How about at least split the word using "_"? Like "mm_map", then > > apply it to all the "mmm*" prefixes. Otherwise it'll be easily > > misread as mmap() which is totally irrelevant to this... > > > > what is the point of keeping 'mm' here? replace it with 'iommu'? I'm not sure of what Baolu thought, but to me "mm" makes sense itself to identify this from real IOMMU page tables (because IIUC these will be MMU page tables). We can come up with better names, but IMHO "iommu" can be a bit misleading to let people refer to the 2nd level page table. Regards, -- Peter Xu
Re: [RFC PATCH 2/4] iommu/vt-d: Add first level page table interfaces
set_pte(pte, pfn_pte(pfn, prot)); > + pfn++; > + } while (pte++, addr += PAGE_SIZE, addr != end); > + > + domain_flush_cache(domain, first_pte, (void *)pte - (void *)first_pte); > + > + return 0; > +} > + > +static int > +mmmap_pmd_range(struct dmar_domain *domain, pud_t *pud, unsigned long addr, > + unsigned long end, phys_addr_t phys_addr, pgprot_t prot) > +{ > + unsigned long next; > + pmd_t *pmd; > + > + if (unlikely(pud_none(*pud))) > + pgtable_populate(domain, pud); > + pmd = pmd_offset(pud, addr); > + > + phys_addr -= addr; > + do { > + next = pmd_addr_end(addr, end); > + if (mmmap_pte_range(domain, pmd, addr, next, > + phys_addr + addr, prot)) > + return -ENOMEM; How about return the errcode of mmmap_pte_range() directly? > + } while (pmd++, addr = next, addr != end); > + > + return 0; > +} > + > +static int > +mmmap_pud_range(struct dmar_domain *domain, p4d_t *p4d, unsigned long addr, > + unsigned long end, phys_addr_t phys_addr, pgprot_t prot) > +{ > + unsigned long next; > + pud_t *pud; > + > + if (unlikely(p4d_none(*p4d))) > + pgtable_populate(domain, p4d); > + > + pud = pud_offset(p4d, addr); > + > + phys_addr -= addr; > + do { > + next = pud_addr_end(addr, end); > + if (mmmap_pmd_range(domain, pud, addr, next, > + phys_addr + addr, prot)) > + return -ENOMEM; Same. > + } while (pud++, addr = next, addr != end); > + > + return 0; > +} > + > +static int > +mmmap_p4d_range(struct dmar_domain *domain, pgd_t *pgd, unsigned long addr, > + unsigned long end, phys_addr_t phys_addr, pgprot_t prot) > +{ > + unsigned long next; > + p4d_t *p4d; > + > + if (cpu_feature_enabled(X86_FEATURE_LA57) && unlikely(pgd_none(*pgd))) > + pgtable_populate(domain, pgd); > + > + p4d = p4d_offset(pgd, addr); > + > + phys_addr -= addr; > + do { > + next = p4d_addr_end(addr, end); > + if (mmmap_pud_range(domain, p4d, addr, next, > + phys_addr + addr, prot)) > + return -ENOMEM; Same. > + } while (p4d++, addr = next, addr != end); > + > + return 0; > +} > + > +int intel_mmmap_range(struct dmar_domain *domain, unsigned long addr, > + unsigned long end, phys_addr_t phys_addr, int dma_prot) > +{ > + unsigned long next; > + pgprot_t prot; > + pgd_t *pgd; > + > + trace_domain_mm_map(domain, addr, end, phys_addr); > + > + /* > + * There is no PAGE_KERNEL_WO for a pte entry, so let's use RW > + * for a pte that requires write operation. > + */ > + prot = dma_prot & DMA_PTE_WRITE ? PAGE_KERNEL : PAGE_KERNEL_RO; > + BUG_ON(addr >= end); > + > + phys_addr -= addr; > + pgd = pgd_offset_pgd(domain->pgd, addr); > + do { > + next = pgd_addr_end(addr, end); > + if (mmmap_p4d_range(domain, pgd, addr, next, > + phys_addr + addr, prot)) > + return -ENOMEM; Same. > + } while (pgd++, addr = next, addr != end); > + > + return 0; > +} > + > +/* > + * mmunmap: Unmap an existing mapping between a range of IO vitual address > + * and physical addresses. > + */ > +static struct page * > +mmunmap_pte_range(struct dmar_domain *domain, pmd_t *pmd, > + unsigned long addr, unsigned long end, > + struct page *freelist, bool reclaim) > +{ > + int i; > + unsigned long start; > + pte_t *pte, *first_pte; > + > + start = addr; > + pte = pte_offset_kernel(pmd, addr); > + first_pte = pte; > + do { > + set_pte(pte, __pte(0)); > + } while (pte++, addr += PAGE_SIZE, addr != end); > + > + domain_flush_cache(domain, first_pte, (void *)pte - (void *)first_pte); > + > + /* Add page to free list if all entries are empty. */ > + if (reclaim) { Shouldn't we know whether to reclaim if with (addr, end) specified as long as they cover the whole range of this PMD? Also I noticed that this value right now is passed in from the very top of the unmap() call. I didn't really catch the point of that... I'll have similar question to below a few places but I'll skip to comment after I understand this one. > + struct page *pte_page; > + > + pte = (pte_t *)pmd_page_vaddr(*pmd); > + for (i = 0; i < PTRS_PER_PTE; i++) > + if (!pte || !pte_none(pte[i])) > + goto pte_out; > + > + pte_page = pmd_page(*pmd); > + pte_page->freelist = freelist; > + freelist = pte_page; > + pmd_clear(pmd); > + domain_flush_cache(domain, pmd, sizeof(pmd_t)); > + } > + > +pte_out: > + return freelist; > +} Regards, -- Peter Xu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH 2/4] iommu/vt-d: Add first level page table interfaces
On Tue, Sep 24, 2019 at 09:38:53AM +0800, Lu Baolu wrote: > > > intel_mmmap_range(domain, addr, end, phys_addr, prot) > > > > Maybe think of a different name..? mmmap seems a bit weird :-) > > Yes. I don't like it either. I've thought about it and haven't > figured out a satisfied one. Do you have any suggestions? How about at least split the word using "_"? Like "mm_map", then apply it to all the "mmm*" prefixes. Otherwise it'll be easily misread as mmap() which is totally irrelevant to this... Regards, -- Peter Xu
[PATCH] Revert "iommu/vt-d: Fix lock inversion between iommu->lock and device_domain_lock"
This reverts commit 7560cc3ca7d9d11555f80c830544e463fcdb28b8. With 5.2.0-rc5 I can easily trigger this with lockdep and iommu=pt: == WARNING: possible circular locking dependency detected 5.2.0-rc5 #78 Not tainted -- swapper/0/1 is trying to acquire lock: ea2b3beb (&(>lock)->rlock){+.+.}, at: domain_context_mapping_one+0xa5/0x4e0 but task is already holding lock: a681907b (device_domain_lock){}, at: domain_context_mapping_one+0x8d/0x4e0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (device_domain_lock){}: _raw_spin_lock_irqsave+0x3c/0x50 dmar_insert_one_dev_info+0xbb/0x510 domain_add_dev_info+0x50/0x90 dev_prepare_static_identity_mapping+0x30/0x68 intel_iommu_init+0xddd/0x1422 pci_iommu_init+0x16/0x3f do_one_initcall+0x5d/0x2b4 kernel_init_freeable+0x218/0x2c1 kernel_init+0xa/0x100 ret_from_fork+0x3a/0x50 -> #0 (&(>lock)->rlock){+.+.}: lock_acquire+0x9e/0x170 _raw_spin_lock+0x25/0x30 domain_context_mapping_one+0xa5/0x4e0 pci_for_each_dma_alias+0x30/0x140 dmar_insert_one_dev_info+0x3b2/0x510 domain_add_dev_info+0x50/0x90 dev_prepare_static_identity_mapping+0x30/0x68 intel_iommu_init+0xddd/0x1422 pci_iommu_init+0x16/0x3f do_one_initcall+0x5d/0x2b4 kernel_init_freeable+0x218/0x2c1 kernel_init+0xa/0x100 ret_from_fork+0x3a/0x50 other info that might help us debug this: Possible unsafe locking scenario: CPU0CPU1 lock(device_domain_lock); lock(&(>lock)->rlock); lock(device_domain_lock); lock(&(>lock)->rlock); *** DEADLOCK *** 2 locks held by swapper/0/1: #0: 033eb13d (dmar_global_lock){}, at: intel_iommu_init+0x1e0/0x1422 #1: a681907b (device_domain_lock){}, at: domain_context_mapping_one+0x8d/0x4e0 stack backtrace: CPU: 2 PID: 1 Comm: swapper/0 Not tainted 5.2.0-rc5 #78 Hardware name: LENOVO 20KGS35G01/20KGS35G01, BIOS N23ET50W (1.25 ) 06/25/2018 Call Trace: dump_stack+0x85/0xc0 print_circular_bug.cold.57+0x15c/0x195 __lock_acquire+0x152a/0x1710 lock_acquire+0x9e/0x170 ? domain_context_mapping_one+0xa5/0x4e0 _raw_spin_lock+0x25/0x30 ? domain_context_mapping_one+0xa5/0x4e0 domain_context_mapping_one+0xa5/0x4e0 ? domain_context_mapping_one+0x4e0/0x4e0 pci_for_each_dma_alias+0x30/0x140 dmar_insert_one_dev_info+0x3b2/0x510 domain_add_dev_info+0x50/0x90 dev_prepare_static_identity_mapping+0x30/0x68 intel_iommu_init+0xddd/0x1422 ? printk+0x58/0x6f ? lockdep_hardirqs_on+0xf0/0x180 ? do_early_param+0x8e/0x8e ? e820__memblock_setup+0x63/0x63 pci_iommu_init+0x16/0x3f do_one_initcall+0x5d/0x2b4 ? do_early_param+0x8e/0x8e ? rcu_read_lock_sched_held+0x55/0x60 ? do_early_param+0x8e/0x8e kernel_init_freeable+0x218/0x2c1 ? rest_init+0x230/0x230 kernel_init+0xa/0x100 ret_from_fork+0x3a/0x50 domain_context_mapping_one() is taking device_domain_lock first then iommu lock, while dmar_insert_one_dev_info() is doing the reverse. That should be introduced by commit: 7560cc3ca7d9 ("iommu/vt-d: Fix lock inversion between iommu->lock and device_domain_lock", 2019-05-27) So far I still cannot figure out how the previous deadlock was triggered (I cannot find iommu lock taken before calling of iommu_flush_dev_iotlb()), however I'm pretty sure that that change should be incomplete at least because it does not fix all the places so we're still taking the locks in different orders, while reverting that commit is very clean to me so far that we should always take device_domain_lock first then the iommu lock. We can continue to try to find the real culprit mentioned in 7560cc3ca7d9, but for now I think we should revert it to fix current breakage. CC: Joerg Roedel CC: Lu Baolu CC: dave.ji...@intel.com Signed-off-by: Peter Xu --- drivers/iommu/intel-iommu.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c index 56297298d6ee..162b3236e72c 100644 --- a/drivers/iommu/intel-iommu.c +++ b/drivers/iommu/intel-iommu.c @@ -2504,7 +2504,6 @@ static struct dmar_domain *dmar_insert_one_dev_info(struct intel_iommu *iommu, } } - spin_lock(>lock); spin_lock_irqsave(_domain_lock, flags); if (dev)
Re: VT-d deadlock issue on device_domain_lock and iommu lock (5.2-rc5)
On Fri, Jun 21, 2019 at 09:58:28AM +0800, Lu Baolu wrote: > Hi Peter, > > I agree with you that 7560cc3ca7d9 ("iommu/vt-d: Fix lock inversion between > iommu->lock and device_domain_lock") isn't a good fix. There > is also another thread, https://lkml.org/lkml/2019/6/18/996, which > reported this. > > I think we can revert this patch now. I will try to reproduce the > original issue and try to find a new fix. > > Can you please submit the revert patch? Sure, I'll post a formal patch soon. Thanks, -- Peter Xu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
VT-d deadlock issue on device_domain_lock and iommu lock (5.2-rc5)
2 Jun 20 14:37:37 xz-x1 kernel: ? printk+0x58/0x6f Jun 20 14:37:37 xz-x1 kernel: ? lockdep_hardirqs_on+0xf0/0x180 Jun 20 14:37:37 xz-x1 kernel: ? do_early_param+0x8e/0x8e Jun 20 14:37:37 xz-x1 kernel: ? e820__memblock_setup+0x63/0x63 Jun 20 14:37:37 xz-x1 kernel: pci_iommu_init+0x16/0x3f Jun 20 14:37:37 xz-x1 kernel: do_one_initcall+0x5d/0x2b4 Jun 20 14:37:37 xz-x1 kernel: ? do_early_param+0x8e/0x8e Jun 20 14:37:37 xz-x1 kernel: ? rcu_read_lock_sched_held+0x55/0x60 Jun 20 14:37:37 xz-x1 kernel: ? do_early_param+0x8e/0x8e Jun 20 14:37:37 xz-x1 kernel: kernel_init_freeable+0x218/0x2c1 Jun 20 14:37:37 xz-x1 kernel: ? rest_init+0x230/0x230 Jun 20 14:37:37 xz-x1 kernel: kernel_init+0xa/0x100 Jun 20 14:37:37 xz-x1 kernel: ret_from_fork+0x3a/0x50 domain_context_mapping_one() is taking device_domain_lock first then iommu lock, while dmar_insert_one_dev_info() is doing the reverse. I digged a bit and I saw this commit which seems suspicous: 7560cc3ca7d9 ("iommu/vt-d: Fix lock inversion between iommu->lock and device_domain_lock", 2019-05-27) More interestingly, it was trying to fix the inverted deadlock... The thing is that even if above commit is correct on the ordering (I still feel strange that we need to take a per-iommu lock before another global lock), that commit seems to be an incomplete fix because there's still other places that are using the other way round. When I read deeper into that commit message, it seems to be telling me that before reaching iommu_flush_dev_iotlb() we've got iommu lock somewhere but I cannot really understand how it happened because I cannot find a path that iommu lock is taken when reaching iommu_flush_dev_iotlb(). So I cannot understand how that lockdep warning message could trigger. I reverted that commit and now everything is good here (no long runs but at least previous deadlock issue is fixed). And with that, IMHO we'll actually have the correct ordering in the whole repository that we'll take device_domain_lock before per iommu lock always. Is there anything I've missed on why we have had 7560cc3ca7d9? Thanks, -- Peter Xu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: What is the meaning of PASID_MIN?
On Tue, Feb 12, 2019 at 10:44:23AM +0800, Lu Baolu wrote: > Hi, > > On 2/12/19 3:28 AM, Matthew Wilcox wrote: > > > > I'm looking at commit 562831747f6299abd481b5b00bd4fa19d5c8a259 > > which fails to adequately explain why we can't use PASID 0. Commit > > af39507305fb83a5d3c475c2851f4d59545d8a18 also doesn't explain why PASID > > 0 is no longer usable for the intel-svm driver. > > Sorry that we didn't make it clear. > > > > > There are a load of simplifications that could be made to this, but I > > don't know which ones to suggest without a clear understanding of the > > problem you're actually trying to solve. > > > > PASID 0 has been reserved by Intel IOMMU driver for RID_PASID purpose. > > VT-d scalable mode treats all address translation as PASID granularity. > For DMA requests-with-PASID, the PASID value in the DMA request will be > used. For DMA requests-without-PASID, VT-d will use a static PASID value > specified in the RID_PASID field of the context entry. PASID 0 has been > reserved for this usage for all devices. > > (Please refer to 9.4 of the spec 3.0 for more details.) Hi, Baolu, I have a similar confusion. If PASID==0 is reserved for requests-without-PASID, then does it mean that for each scalable mode context entry the RID_PASID field will always be zero? Or say, since we already have the per-context-entry RID_PASID field which seems to be configurable, why we still need to reserve the PASID==0? Thanks, -- Peter Xu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 2/2] iommu/amd: Remove clear_flush_young notifier
On Fri, Feb 01, 2019 at 11:46:00AM +, Jean-Philippe Brucker wrote: > On 01/02/2019 03:51, Peter Xu wrote: > > On Thu, Jan 31, 2019 at 12:25:40PM +, Jean-Philippe Brucker wrote: > >> On 31/01/2019 07:59, Peter Xu wrote: > >>> On Wed, Jan 30, 2019 at 12:27:40PM +, Jean-Philippe Brucker wrote: > >>>> Hi Peter, > >>> > >>> Hi, Jean, > >>> > >>>> > >>>> On 30/01/2019 05:57, Peter Xu wrote: > >>>>> AMD IOMMU driver is using the clear_flush_young() to do cache flushing > >>>>> but that's actually already covered by invalidate_range(). Remove the > >>>>> extra notifier and the chunks. > >>>> > >>>> Aren't uses of clear_flush_young() and invalidate_range() orthogonal? If > >>>> I understood correctly, when doing reclaim the kernel clears the young > >>>> bit from the PTE. This requires flushing secondary TLBs (ATCs), so that > >>>> new accesses from devices will go through the IOMMU, set the young bit > >>>> again and the kernel can accurately track page use. I didn't see > >>>> invalidate_range() being called by rmap or vmscan in this case, but > >>>> might just be missing it. > >>>> > >>>> Two MMU notifiers are used for the young bit, clear_flush_young() and > >>>> clear_young(). I believe the former should invalidate ATCs so that DMA > >>>> accesses participate in PTE aging. Otherwise the kernel can't know that > >>>> the device is still using entries marked 'old'. The latter, > >>>> clear_young() is exempted from invalidating the secondary TLBs by > >>>> mmu_notifier.h and the IOMMU driver doesn't need to implement it. > >>> > >>> Yes I agree most of your analysis, but IMHO the problem is that the > >>> AMD IOMMU is not really participating in the PTE aging after all. > >>> Please have a look at mn_clear_flush_young() below at [1] - it's > >>> always returning zero, no matter whether the page has been accessed by > >>> device or not. A real user of it could be someone like KVM (please > >>> see kvm_mmu_notifier_clear_flush_young) where we'll try to lookup the > >>> shadow PTEs and do test-and-clear on that, then return the result to > >>> the core mm. That's why I think currently the AMD driver was only > >>> trying to use that as a way to do an extra flush however IMHO it's > >>> redundant. > >> > >> Yes, in IOMMU drivers clear_flush_young() doesn't do the clear, only the > >> flush, since the level-1 page table is shared with the CPU. But removing > >> the flush gets you in the following situation: > >> > >> (1) Devices wants to access $addr, sends ATS query, the IOMMU sets PTE > >> young and the entry is cached in the ATC. > >> > >> (2) The CPU does ptep_clear_flush_young_notify(), clears young, > >> notices that the page is being used. > >> > >> (3) Device accesses $addr again. Given that we didn't invalidate the > >> ATC in (2) it accesses the page directly, without going through the IOMMU. > >> > >> (4) CPU does ptep_clear_flush_young_notify() again, the PTE doesn't > >> have the young bit, which means the page isn't being used and can be > >> reclaimed. > >> > >> That's not accurate since the page is being used by the device. At step > >> (2) we should invalidate the ATC, so that (3) fetches the PTE again and > >> marks it young. > >> > >> I can agree that the clear_flush_young() notifier is too brutal for this > >> purpose, since we send spurious ATC invalidation even when the PTE > >> wasn't young (and ATC inv is expensive). There should be another MMU > >> notifier "flush_young()" that is called by > >> ptep_clear_flush_young_notify() only when the page was actually young. > >> But for the moment it's the best we have to avoid the situation above. > >> > >> I don't know enough about mm to understand exactly how the > >> page_referenced() information is used, but I believe the problem is only > >> about accuracy and not correctness. So applying this patch might not > >> break anything (after all, intel-svm.c never implemented the notifier) > >> but I think I'll keep the notifier in my SVA rework [1]. > > > > I see your point. I'm not an expert of mm either, but I'd say AFAIU > > this happens in CPU TLB too.
Re: [PATCH 2/2] iommu/amd: Remove clear_flush_young notifier
On Thu, Jan 31, 2019 at 12:25:40PM +, Jean-Philippe Brucker wrote: > On 31/01/2019 07:59, Peter Xu wrote: > > On Wed, Jan 30, 2019 at 12:27:40PM +, Jean-Philippe Brucker wrote: > >> Hi Peter, > > > > Hi, Jean, > > > >> > >> On 30/01/2019 05:57, Peter Xu wrote: > >>> AMD IOMMU driver is using the clear_flush_young() to do cache flushing > >>> but that's actually already covered by invalidate_range(). Remove the > >>> extra notifier and the chunks. > >> > >> Aren't uses of clear_flush_young() and invalidate_range() orthogonal? If > >> I understood correctly, when doing reclaim the kernel clears the young > >> bit from the PTE. This requires flushing secondary TLBs (ATCs), so that > >> new accesses from devices will go through the IOMMU, set the young bit > >> again and the kernel can accurately track page use. I didn't see > >> invalidate_range() being called by rmap or vmscan in this case, but > >> might just be missing it. > >> > >> Two MMU notifiers are used for the young bit, clear_flush_young() and > >> clear_young(). I believe the former should invalidate ATCs so that DMA > >> accesses participate in PTE aging. Otherwise the kernel can't know that > >> the device is still using entries marked 'old'. The latter, > >> clear_young() is exempted from invalidating the secondary TLBs by > >> mmu_notifier.h and the IOMMU driver doesn't need to implement it. > > > > Yes I agree most of your analysis, but IMHO the problem is that the > > AMD IOMMU is not really participating in the PTE aging after all. > > Please have a look at mn_clear_flush_young() below at [1] - it's > > always returning zero, no matter whether the page has been accessed by > > device or not. A real user of it could be someone like KVM (please > > see kvm_mmu_notifier_clear_flush_young) where we'll try to lookup the > > shadow PTEs and do test-and-clear on that, then return the result to > > the core mm. That's why I think currently the AMD driver was only > > trying to use that as a way to do an extra flush however IMHO it's > > redundant. > > Yes, in IOMMU drivers clear_flush_young() doesn't do the clear, only the > flush, since the level-1 page table is shared with the CPU. But removing > the flush gets you in the following situation: > > (1) Devices wants to access $addr, sends ATS query, the IOMMU sets PTE > young and the entry is cached in the ATC. > > (2) The CPU does ptep_clear_flush_young_notify(), clears young, > notices that the page is being used. > > (3) Device accesses $addr again. Given that we didn't invalidate the > ATC in (2) it accesses the page directly, without going through the IOMMU. > > (4) CPU does ptep_clear_flush_young_notify() again, the PTE doesn't > have the young bit, which means the page isn't being used and can be > reclaimed. > > That's not accurate since the page is being used by the device. At step > (2) we should invalidate the ATC, so that (3) fetches the PTE again and > marks it young. > > I can agree that the clear_flush_young() notifier is too brutal for this > purpose, since we send spurious ATC invalidation even when the PTE > wasn't young (and ATC inv is expensive). There should be another MMU > notifier "flush_young()" that is called by > ptep_clear_flush_young_notify() only when the page was actually young. > But for the moment it's the best we have to avoid the situation above. > > I don't know enough about mm to understand exactly how the > page_referenced() information is used, but I believe the problem is only > about accuracy and not correctness. So applying this patch might not > break anything (after all, intel-svm.c never implemented the notifier) > but I think I'll keep the notifier in my SVA rework [1]. I see your point. I'm not an expert of mm either, but I'd say AFAIU this happens in CPU TLB too. Please have a look at ptep_clear_flush_young(): int ptep_clear_flush_young(struct vm_area_struct *vma, unsigned long address, pte_t *ptep) { /* * On x86 CPUs, clearing the accessed bit without a TLB flush * doesn't cause data corruption. [ It could cause incorrect * page aging and the (mistaken) reclaim of hot pages, but the * chance of that should be relatively low. ] * * So as a performance optimization don't flush the TLB when * clearing the accessed bit, it will eventually be flushed by * a context switch or a VM operation anyway. [ In the rare * event of it not getting flushed for a long time the delay * shouldn't really matter bec
Re: [PATCH 2/2] iommu/amd: Remove clear_flush_young notifier
On Wed, Jan 30, 2019 at 12:27:40PM +, Jean-Philippe Brucker wrote: > Hi Peter, Hi, Jean, > > On 30/01/2019 05:57, Peter Xu wrote: > > AMD IOMMU driver is using the clear_flush_young() to do cache flushing > > but that's actually already covered by invalidate_range(). Remove the > > extra notifier and the chunks. > > Aren't uses of clear_flush_young() and invalidate_range() orthogonal? If > I understood correctly, when doing reclaim the kernel clears the young > bit from the PTE. This requires flushing secondary TLBs (ATCs), so that > new accesses from devices will go through the IOMMU, set the young bit > again and the kernel can accurately track page use. I didn't see > invalidate_range() being called by rmap or vmscan in this case, but > might just be missing it. > > Two MMU notifiers are used for the young bit, clear_flush_young() and > clear_young(). I believe the former should invalidate ATCs so that DMA > accesses participate in PTE aging. Otherwise the kernel can't know that > the device is still using entries marked 'old'. The latter, > clear_young() is exempted from invalidating the secondary TLBs by > mmu_notifier.h and the IOMMU driver doesn't need to implement it. Yes I agree most of your analysis, but IMHO the problem is that the AMD IOMMU is not really participating in the PTE aging after all. Please have a look at mn_clear_flush_young() below at [1] - it's always returning zero, no matter whether the page has been accessed by device or not. A real user of it could be someone like KVM (please see kvm_mmu_notifier_clear_flush_young) where we'll try to lookup the shadow PTEs and do test-and-clear on that, then return the result to the core mm. That's why I think currently the AMD driver was only trying to use that as a way to do an extra flush however IMHO it's redundant. Thanks, > > -static int mn_clear_flush_young(struct mmu_notifier *mn, > > - struct mm_struct *mm, > > - unsigned long start, > > - unsigned long end) > > -{ > > - for (; start < end; start += PAGE_SIZE) > > - __mn_flush_page(mn, start); > > - > > - return 0; [1] -- Peter Xu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH 1/2] iommu/vt-d: Remove change_pte notifier
The change_pte() interface is tailored for PFN updates, while the other notifier invalidate_range() should be enough for Intel IOMMU cache flushing. Actually we've done similar thing for AMD IOMMU already in 8301da53fbc1 ("iommu/amd: Remove change_pte mmu_notifier call-back", 2014-07-30) but the Intel IOMMU driver still have it. Signed-off-by: Peter Xu --- drivers/iommu/intel-svm.c | 9 - 1 file changed, 9 deletions(-) diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c index a2a2aa4439aa..e9fd3ca057ac 100644 --- a/drivers/iommu/intel-svm.c +++ b/drivers/iommu/intel-svm.c @@ -180,14 +180,6 @@ static void intel_flush_svm_range(struct intel_svm *svm, unsigned long address, rcu_read_unlock(); } -static void intel_change_pte(struct mmu_notifier *mn, struct mm_struct *mm, -unsigned long address, pte_t pte) -{ - struct intel_svm *svm = container_of(mn, struct intel_svm, notifier); - - intel_flush_svm_range(svm, address, 1, 1, 0); -} - /* Pages have been freed at this point */ static void intel_invalidate_range(struct mmu_notifier *mn, struct mm_struct *mm, @@ -227,7 +219,6 @@ static void intel_mm_release(struct mmu_notifier *mn, struct mm_struct *mm) static const struct mmu_notifier_ops intel_mmuops = { .release = intel_mm_release, - .change_pte = intel_change_pte, .invalidate_range = intel_invalidate_range, }; -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH 2/2] iommu/amd: Remove clear_flush_young notifier
AMD IOMMU driver is using the clear_flush_young() to do cache flushing but that's actually already covered by invalidate_range(). Remove the extra notifier and the chunks. Signed-off-by: Peter Xu --- drivers/iommu/amd_iommu_v2.c | 24 1 file changed, 24 deletions(-) diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c index 23dae9348ace..5d7ef750e4a0 100644 --- a/drivers/iommu/amd_iommu_v2.c +++ b/drivers/iommu/amd_iommu_v2.c @@ -370,29 +370,6 @@ static struct pasid_state *mn_to_state(struct mmu_notifier *mn) return container_of(mn, struct pasid_state, mn); } -static void __mn_flush_page(struct mmu_notifier *mn, - unsigned long address) -{ - struct pasid_state *pasid_state; - struct device_state *dev_state; - - pasid_state = mn_to_state(mn); - dev_state = pasid_state->device_state; - - amd_iommu_flush_page(dev_state->domain, pasid_state->pasid, address); -} - -static int mn_clear_flush_young(struct mmu_notifier *mn, - struct mm_struct *mm, - unsigned long start, - unsigned long end) -{ - for (; start < end; start += PAGE_SIZE) - __mn_flush_page(mn, start); - - return 0; -} - static void mn_invalidate_range(struct mmu_notifier *mn, struct mm_struct *mm, unsigned long start, unsigned long end) @@ -430,7 +407,6 @@ static void mn_release(struct mmu_notifier *mn, struct mm_struct *mm) static const struct mmu_notifier_ops iommu_mn = { .release= mn_release, - .clear_flush_young = mn_clear_flush_young, .invalidate_range = mn_invalidate_range, }; -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH 0/2] Some MMU notifier cleanups for Intel/AMD IOMMU
Recently when I'm reading the mmu notifiers I noticed that both Intel/AMD IOMMU drivers seem to have redundancies in using the MMU notifiers. It can also be seen as a follow up of commit 8301da53fbc1 ("iommu/amd: Remove change_pte mmu_notifier call-back", 2014-07-30). I don't have hardwares to test them, but they compile well. Please have a look, thanks. Peter Xu (2): iommu/vt-d: Remove change_pte notifier iommu/amd: Remove clear_flush_young notifier drivers/iommu/amd_iommu_v2.c | 24 drivers/iommu/intel-svm.c| 9 - 2 files changed, 33 deletions(-) -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v5 3/9] iommu/vt-d: Apply global PASID in SVA
On Sat, Jul 14, 2018 at 03:46:56PM +0800, Lu Baolu wrote: > This patch applies the global pasid name space in the shared > virtual address (SVA) implementation. > > Cc: Ashok Raj > Cc: Jacob Pan > Cc: Kevin Tian > Cc: Liu Yi L > Signed-off-by: Lu Baolu > Reviewed-by: Kevin Tian > Reviewed-by: Liu Yi L Reviewed-by: Peter Xu -- Peter Xu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v5 2/9] iommu/vt-d: Avoid using idr_for_each_entry()
On Sat, Jul 14, 2018 at 03:46:55PM +0800, Lu Baolu wrote: > idr_for_each_entry() is used to iteratte over idr elements > of a given type. It isn't suitable for the globle pasid idr > since the pasid idr consumer could specify different types > of pointers to bind with a pasid. > > Cc: Ashok Raj > Cc: Jacob Pan > Cc: Kevin Tian > Cc: Liu Yi L > Signed-off-by: Lu Baolu > Reviewed-by: Kevin Tian > Reviewed-by: Liu Yi L Reviewed-by: Peter Xu -- Peter Xu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v5 1/9] iommu/vt-d: Global PASID name space
On Sat, Jul 14, 2018 at 03:46:54PM +0800, Lu Baolu wrote: > This adds the system wide PASID name space for the PASID > allocation. Currently we are using per IOMMU PASID name > spaces which are not suitable for some use cases. For an > example, one application (associated with a PASID) might > talk to two physical devices simultaneously while the two > devices could reside behind two different IOMMU units. > > Cc: Ashok Raj > Cc: Jacob Pan > Cc: Kevin Tian > Cc: Liu Yi L > Suggested-by: Ashok Raj > Signed-off-by: Lu Baolu > Reviewed-by: Kevin Tian > Reviewed-by: Liu Yi L Reviewed-by: Peter Xu -- Peter Xu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v4 9/9] iommu/vt-d: Remove the obsolete per iommu pasid tables
On Fri, Jul 13, 2018 at 09:34:30AM +0800, Lu Baolu wrote: > Hi Peter, > > Do you have further comments for this series? I will work out a > new version if there are no ones. Not for now. Though since you mentioned about the work to remove the pasid state table, not sure whether it'll be nice to put that into this series, or even put more patches that are directly related (e.g. including the two-level pasid table work and anything related) so the series could be more complete. > > Thank you for reviewing my patches. Do you allow me to add your > reviewed-by to the patches which you've reviewed? I'm glad to read your work. I didn't leave r-bs because I'm not that familiar with the codes so I'm not confident to leave any, just to raise the questions. Please feel free to post a new version as you wish, I'll try to leave r-b there if possible. Thanks, -- Peter Xu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v4 6/9] iommu/vt-d: Per PCI device pasid table interfaces
On Wed, Jul 11, 2018 at 03:26:21PM +0800, Lu Baolu wrote: [...] > >> +int intel_pasid_alloc_table(struct device *dev) > >> +{ > >> + struct device_domain_info *info; > >> + struct pasid_table *pasid_table; > >> + struct pasid_table_opaque data; > >> + struct page *pages; > >> + size_t size, count; > >> + int ret, order; > >> + > >> + info = dev->archdata.iommu; > >> + if (WARN_ON(!info || !dev_is_pci(dev) || > >> + !info->pasid_supported || info->pasid_table)) > >> + return -EINVAL; > >> + > >> + /* DMA alias device already has a pasid table, use it: */ > >> + data.pasid_table = _table; > >> + ret = pci_for_each_dma_alias(to_pci_dev(dev), > >> + _alias_pasid_table, ); > >> + if (ret) > >> + goto attach_out; > >> + > >> + pasid_table = kzalloc(sizeof(*pasid_table), GFP_ATOMIC); > > Do we need to take some lock here (e.g., the pasid lock)? Otherwise > > what if two devices (that are sharing the same DMA alias) call the > > function intel_pasid_alloc_table() concurrently, then could it > > possible that we create one table for each of the device while AFAIU > > we should let them share a single pasid table? > > The only place where this function is called is in a single-thread context > (protected by a spinlock of device_domain_lock with local interrupt disabled). > > So we don't need an extra lock here. But anyway, I should put a comment > here. Yeah, that would be nice too! Or add a comment for both of the functions: /* Must be with device_domain_lock held */ Regards, -- Peter Xu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v4 1/9] iommu/vt-d: Global PASID name space
On Mon, Jul 09, 2018 at 01:22:50PM +0800, Lu Baolu wrote: [...] > +#ifndef __INTEL_PASID_H > +#define __INTEL_PASID_H > + > +#define PASID_MIN0x1 > +#define PASID_MAX0x2 Could I ask whether there's a reason to explicitly use 0x2 for the max value? Asked since I saw that the example in the spec gave 20 bits for PASID (please refer to spec ver 3.0 section 3.4.3 figure 3-8). Also I believe that's what I was told by Kevin. I saw that the old per-iommu max value is set to 0x2, though I'm not sure whether that's still needed since if we're going to have two-level pasid table then AFAIU we don't need physically continuous memory any more (though I saw that we don't yet have two-level pasid table implemented): /* Eventually I'm promised we will get a multi-level PASID table * and it won't have to be physically contiguous. Until then, * limit the size because 8MiB contiguous allocations can be hard * to come by. The limit of 0x2, which is 1MiB for each of * the PASID and PASID-state tables, is somewhat arbitrary. */ if (iommu->pasid_max > 0x2) iommu->pasid_max = 0x20000; Thanks, -- Peter Xu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v4 9/9] iommu/vt-d: Remove the obsolete per iommu pasid tables
ach_device_domain(int (*fn)(struct > device_domain_info *info, >void *data), void *data); > > #ifdef CONFIG_INTEL_IOMMU_SVM > -extern int intel_svm_alloc_pasid_tables(struct intel_iommu *iommu); > -extern int intel_svm_free_pasid_tables(struct intel_iommu *iommu); > +int intel_svm_init(struct intel_iommu *iommu); > +int intel_svm_exit(struct intel_iommu *iommu); > extern int intel_svm_enable_prq(struct intel_iommu *iommu); > extern int intel_svm_finish_prq(struct intel_iommu *iommu); > > -- > 2.7.4 > > ___ > iommu mailing list > iommu@lists.linux-foundation.org > https://lists.linuxfoundation.org/mailman/listinfo/iommu -- Peter Xu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v4 6/9] iommu/vt-d: Per PCI device pasid table interfaces
val; > -}; > - > struct pasid_state_entry { > u64 val; > }; > diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h > index 4fd4c6f..e7901d4 100644 > --- a/include/linux/intel-iommu.h > +++ b/include/linux/intel-iommu.h > @@ -476,6 +476,7 @@ struct intel_iommu { > struct device_domain_info { > struct list_head link; /* link to domain siblings */ > struct list_head global; /* link to global list */ > + struct list_head table; /* link to pasid table */ > u8 bus; /* PCI bus number */ > u8 devfn; /* PCI devfn number */ > u16 pfsid; /* SRIOV physical function source ID */ > @@ -489,6 +490,7 @@ struct device_domain_info { > struct device *dev; /* it's NULL for PCIe-to-PCI bridge */ > struct intel_iommu *iommu; /* IOMMU used by this device */ > struct dmar_domain *domain; /* pointer to domain */ > + struct pasid_table *pasid_table; /* pasid table */ > }; > > static inline void __iommu_flush_cache( > -- > 2.7.4 > > ___ > iommu mailing list > iommu@lists.linux-foundation.org > https://lists.linuxfoundation.org/mailman/listinfo/iommu -- Peter Xu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v3 2/2] iommu/vt-d: Fix iotlb psi missing for mappings
When caching mode is enabled for IOMMU, we should send explicit IOTLB PSIs even for newly created mappings. However these events are missing for all intel_iommu_map() callers, e.g., iommu_map(). One direct user is the vfio-pci driver. To make sure we'll send the PSIs always when necessary, this patch firstly introduced domain_mapping() helper for page mappings, then fixed the problem by generalizing the explicit map IOTLB PSI logic into that new helper. With that, we let iommu_domain_identity_map() to use the simplified version to avoid sending the notifications, while for all the rest of cases we send the notifications always. For VM case, we send the PSIs to all the backend IOMMUs for the domain. This patch allows the nested device assignment to work with QEMU (assign device firstly to L1 guest, then assign it again to L2 guest). Signed-off-by: Peter Xu <pet...@redhat.com> --- drivers/iommu/intel-iommu.c | 43 + 1 file changed, 34 insertions(+), 9 deletions(-) diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c index 13190a54aba2..601d3789211f 100644 --- a/drivers/iommu/intel-iommu.c +++ b/drivers/iommu/intel-iommu.c @@ -2352,18 +2352,47 @@ static int __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn, return 0; } +static int domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn, + struct scatterlist *sg, unsigned long phys_pfn, + unsigned long nr_pages, int prot) +{ + int ret; + struct intel_iommu *iommu; + + /* Do the real mapping first */ + ret = __domain_mapping(domain, iov_pfn, sg, phys_pfn, nr_pages, prot); + if (ret) + return ret; + + /* Notify about the new mapping */ + if (domain_type_is_vm(domain)) { + /* VM typed domains can have more than one IOMMUs */ + int iommu_id; + for_each_domain_iommu(iommu_id, domain) { + iommu = g_iommus[iommu_id]; + __mapping_notify_one(iommu, domain, iov_pfn, nr_pages); + } + } else { + /* General domains only have one IOMMU */ + iommu = domain_get_iommu(domain); + __mapping_notify_one(iommu, domain, iov_pfn, nr_pages); + } + + return 0; +} + static inline int domain_sg_mapping(struct dmar_domain *domain, unsigned long iov_pfn, struct scatterlist *sg, unsigned long nr_pages, int prot) { - return __domain_mapping(domain, iov_pfn, sg, 0, nr_pages, prot); + return domain_mapping(domain, iov_pfn, sg, 0, nr_pages, prot); } static inline int domain_pfn_mapping(struct dmar_domain *domain, unsigned long iov_pfn, unsigned long phys_pfn, unsigned long nr_pages, int prot) { - return __domain_mapping(domain, iov_pfn, NULL, phys_pfn, nr_pages, prot); + return domain_mapping(domain, iov_pfn, NULL, phys_pfn, nr_pages, prot); } static void domain_context_clear_one(struct intel_iommu *iommu, u8 bus, u8 devfn) @@ -2668,9 +2697,9 @@ static int iommu_domain_identity_map(struct dmar_domain *domain, */ dma_pte_clear_range(domain, first_vpfn, last_vpfn); - return domain_pfn_mapping(domain, first_vpfn, first_vpfn, - last_vpfn - first_vpfn + 1, - DMA_PTE_READ|DMA_PTE_WRITE); + return __domain_mapping(domain, first_vpfn, NULL, + first_vpfn, last_vpfn - first_vpfn + 1, + DMA_PTE_READ|DMA_PTE_WRITE); } static int domain_prepare_identity_map(struct device *dev, @@ -3637,8 +3666,6 @@ static dma_addr_t __intel_map_single(struct device *dev, phys_addr_t paddr, if (ret) goto error; - __mapping_notify_one(iommu, domain, mm_to_dma_pfn(iova_pfn), size); - start_paddr = (phys_addr_t)iova_pfn << PAGE_SHIFT; start_paddr += paddr & ~PAGE_MASK; return start_paddr; @@ -3825,8 +3852,6 @@ static int intel_map_sg(struct device *dev, struct scatterlist *sglist, int nele return 0; } - __mapping_notify_one(iommu, domain, start_vpfn, size); - return nelems; } -- 2.17.0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v3 0/2] iommu/vt-d: Fix mapping PSI missing for iommu_map()
v3: - drop the pr_debug patch [Joerg] - rename all the subjects as suggested [Joerg] - rebase v2: - cc correct people and iommu list (PSI stands for: Page Selective Invalidations) Intel IOMMU has the caching mode to ease emulation of the device. When that bit is set, we need to send PSIs even for newly mapped pages. However current driver is not fully obey the rule. E.g., iommu_map() API will only do the mapping but it never sent the PSIs before. That can be problematic to emulated IOMMU devices since they'll never be able to build up the shadow page tables if without such information. This patchset tries to fix the problem. Patch 1 introduces a helper to notify the MAP PSIs. Patch 2 fixes the real problem by making sure every domain mapping will trigger the MAP PSI notifications. Without the patchset, nested device assignment (assign one device firstly to L1 guest, then to L2 guest) won't work for QEMU. After applying the patchset, it works. Please review. Thanks. Peter Xu (2): iommu/vt-d: Introduce __mapping_notify_one() iommu/vt-d: Fix iotlb psi missing for mappings drivers/iommu/intel-iommu.c | 65 ++--- 1 file changed, 46 insertions(+), 19 deletions(-) -- 2.17.0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v3 1/2] iommu/vt-d: Introduce __mapping_notify_one()
Introduce this new helper to notify one newly created mapping on one single IOMMU. We can further leverage this helper in the next patch. Signed-off-by: Peter Xu <pet...@redhat.com> --- drivers/iommu/intel-iommu.c | 26 ++ 1 file changed, 14 insertions(+), 12 deletions(-) diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c index 749d8f235346..13190a54aba2 100644 --- a/drivers/iommu/intel-iommu.c +++ b/drivers/iommu/intel-iommu.c @@ -1606,6 +1606,18 @@ static void iommu_flush_iotlb_psi(struct intel_iommu *iommu, iommu_flush_dev_iotlb(domain, addr, mask); } +/* Notification for newly created mappings */ +static inline void __mapping_notify_one(struct intel_iommu *iommu, + struct dmar_domain *domain, + unsigned long pfn, unsigned int pages) +{ + /* It's a non-present to present mapping. Only flush if caching mode */ + if (cap_caching_mode(iommu->cap)) + iommu_flush_iotlb_psi(iommu, domain, pfn, pages, 0, 1); + else + iommu_flush_write_buffer(iommu); +} + static void iommu_flush_iova(struct iova_domain *iovad) { struct dmar_domain *domain; @@ -3625,13 +3637,7 @@ static dma_addr_t __intel_map_single(struct device *dev, phys_addr_t paddr, if (ret) goto error; - /* it's a non-present to present mapping. Only flush if caching mode */ - if (cap_caching_mode(iommu->cap)) - iommu_flush_iotlb_psi(iommu, domain, - mm_to_dma_pfn(iova_pfn), - size, 0, 1); - else - iommu_flush_write_buffer(iommu); + __mapping_notify_one(iommu, domain, mm_to_dma_pfn(iova_pfn), size); start_paddr = (phys_addr_t)iova_pfn << PAGE_SHIFT; start_paddr += paddr & ~PAGE_MASK; @@ -3819,11 +3825,7 @@ static int intel_map_sg(struct device *dev, struct scatterlist *sglist, int nele return 0; } - /* it's a non-present to present mapping. Only flush if caching mode */ - if (cap_caching_mode(iommu->cap)) - iommu_flush_iotlb_psi(iommu, domain, start_vpfn, size, 0, 1); - else - iommu_flush_write_buffer(iommu); + __mapping_notify_one(iommu, domain, start_vpfn, size); return nelems; } -- 2.17.0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 2/3] intel-iommu: generalize __mapping_notify_one()
On Thu, May 03, 2018 at 03:14:24PM +0200, Joerg Roedel wrote: > On Wed, Apr 18, 2018 at 04:39:52PM +0800, Peter Xu wrote: > > Generalize this new helper to notify one newly created mapping on one > > single IOMMU. We can further leverage this helper in the next patch. > > You introduce the function, you do not generalize it. Please fix that in > the subject and description. > > Also, please drop patch 1, I don't like this pr_debug stuff in the > code. And please also fix the subject line in general to match the form: > > iommu/vt-d: _I_ntroduce __mapping_notify_one() > > When that is done, please re-send. Will do. Thanks, Joerg. -- Peter Xu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 2/3] intel-iommu: generalize __mapping_notify_one()
Generalize this new helper to notify one newly created mapping on one single IOMMU. We can further leverage this helper in the next patch. Signed-off-by: Peter Xu <pet...@redhat.com> --- drivers/iommu/intel-iommu.c | 26 ++ 1 file changed, 14 insertions(+), 12 deletions(-) diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c index a64da83e867c..bf111e60857c 100644 --- a/drivers/iommu/intel-iommu.c +++ b/drivers/iommu/intel-iommu.c @@ -1607,6 +1607,18 @@ static void iommu_flush_iotlb_psi(struct intel_iommu *iommu, iommu_flush_dev_iotlb(domain, addr, mask); } +/* Notification for newly created mappings */ +static inline void __mapping_notify_one(struct intel_iommu *iommu, + struct dmar_domain *domain, + unsigned long pfn, unsigned int pages) +{ + /* It's a non-present to present mapping. Only flush if caching mode */ + if (cap_caching_mode(iommu->cap)) + iommu_flush_iotlb_psi(iommu, domain, pfn, pages, 0, 1); + else + iommu_flush_write_buffer(iommu); +} + static void iommu_flush_iova(struct iova_domain *iovad) { struct dmar_domain *domain; @@ -3626,13 +3638,7 @@ static dma_addr_t __intel_map_single(struct device *dev, phys_addr_t paddr, if (ret) goto error; - /* it's a non-present to present mapping. Only flush if caching mode */ - if (cap_caching_mode(iommu->cap)) - iommu_flush_iotlb_psi(iommu, domain, - mm_to_dma_pfn(iova_pfn), - size, 0, 1); - else - iommu_flush_write_buffer(iommu); + __mapping_notify_one(iommu, domain, mm_to_dma_pfn(iova_pfn), size); start_paddr = (phys_addr_t)iova_pfn << PAGE_SHIFT; start_paddr += paddr & ~PAGE_MASK; @@ -3851,11 +3857,7 @@ static int intel_map_sg(struct device *dev, struct scatterlist *sglist, int nele return 0; } - /* it's a non-present to present mapping. Only flush if caching mode */ - if (cap_caching_mode(iommu->cap)) - iommu_flush_iotlb_psi(iommu, domain, start_vpfn, size, 0, 1); - else - iommu_flush_write_buffer(iommu); + __mapping_notify_one(iommu, domain, start_vpfn, size); return nelems; } -- 2.14.3 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 3/3] intel-iommu: fix iotlb psi missing for mappings
When caching mode is enabled for IOMMU, we should send explicit IOTLB PSIs even for newly created mappings. However these events are missing for all intel_iommu_map() callers, e.g., iommu_map(). One direct user is the vfio-pci driver. To make sure we'll send the PSIs always when necessary, this patch firstly introduced domain_mapping() helper for page mappings, then fixed the problem by generalizing the explicit map IOTLB PSI logic into that new helper. With that, we let iommu_domain_identity_map() to use the simplified version to avoid sending the notifications, while for all the rest of cases we send the notifications always. For VM case, we send the PSIs to all the backend IOMMUs for the domain. This patch allows the nested device assignment to work with QEMU (assign device firstly to L1 guest, then assign it again to L2 guest). Signed-off-by: Peter Xu <pet...@redhat.com> --- drivers/iommu/intel-iommu.c | 43 ++- 1 file changed, 34 insertions(+), 9 deletions(-) diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c index bf111e60857c..eb0f0911342f 100644 --- a/drivers/iommu/intel-iommu.c +++ b/drivers/iommu/intel-iommu.c @@ -2353,18 +2353,47 @@ static int __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn, return 0; } +static int domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn, + struct scatterlist *sg, unsigned long phys_pfn, + unsigned long nr_pages, int prot) +{ + int ret; + struct intel_iommu *iommu; + + /* Do the real mapping first */ + ret = __domain_mapping(domain, iov_pfn, sg, phys_pfn, nr_pages, prot); + if (ret) + return ret; + + /* Notify about the new mapping */ + if (domain_type_is_vm(domain)) { + /* VM typed domains can have more than one IOMMUs */ + int iommu_id; + for_each_domain_iommu(iommu_id, domain) { + iommu = g_iommus[iommu_id]; + __mapping_notify_one(iommu, domain, iov_pfn, nr_pages); + } + } else { + /* General domains only have one IOMMU */ + iommu = domain_get_iommu(domain); + __mapping_notify_one(iommu, domain, iov_pfn, nr_pages); + } + + return 0; +} + static inline int domain_sg_mapping(struct dmar_domain *domain, unsigned long iov_pfn, struct scatterlist *sg, unsigned long nr_pages, int prot) { - return __domain_mapping(domain, iov_pfn, sg, 0, nr_pages, prot); + return domain_mapping(domain, iov_pfn, sg, 0, nr_pages, prot); } static inline int domain_pfn_mapping(struct dmar_domain *domain, unsigned long iov_pfn, unsigned long phys_pfn, unsigned long nr_pages, int prot) { - return __domain_mapping(domain, iov_pfn, NULL, phys_pfn, nr_pages, prot); + return domain_mapping(domain, iov_pfn, NULL, phys_pfn, nr_pages, prot); } static void domain_context_clear_one(struct intel_iommu *iommu, u8 bus, u8 devfn) @@ -2669,9 +2698,9 @@ static int iommu_domain_identity_map(struct dmar_domain *domain, */ dma_pte_clear_range(domain, first_vpfn, last_vpfn); - return domain_pfn_mapping(domain, first_vpfn, first_vpfn, - last_vpfn - first_vpfn + 1, - DMA_PTE_READ|DMA_PTE_WRITE); + return __domain_mapping(domain, first_vpfn, NULL, + first_vpfn, last_vpfn - first_vpfn + 1, + DMA_PTE_READ|DMA_PTE_WRITE); } static int domain_prepare_identity_map(struct device *dev, @@ -3638,8 +3667,6 @@ static dma_addr_t __intel_map_single(struct device *dev, phys_addr_t paddr, if (ret) goto error; - __mapping_notify_one(iommu, domain, mm_to_dma_pfn(iova_pfn), size); - start_paddr = (phys_addr_t)iova_pfn << PAGE_SHIFT; start_paddr += paddr & ~PAGE_MASK; return start_paddr; @@ -3857,8 +3884,6 @@ static int intel_map_sg(struct device *dev, struct scatterlist *sglist, int nele return 0; } - __mapping_notify_one(iommu, domain, start_vpfn, size); - return nelems; } -- 2.14.3 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 1/3] intel-iommu: add some traces for PSIs
It is helpful to debug and triage PSI notification missings. Signed-off-by: Peter Xu <pet...@redhat.com> --- drivers/iommu/dmar.c| 3 +++ drivers/iommu/intel-iommu.c | 3 +++ 2 files changed, 6 insertions(+) diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c index 9a7ffd13c7f0..62ae26c3f7b7 100644 --- a/drivers/iommu/dmar.c +++ b/drivers/iommu/dmar.c @@ -1325,6 +1325,9 @@ void qi_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr, struct qi_desc desc; int ih = 0; + pr_debug("%s: iommu %d did %u addr 0x%llx order %u type %llx\n", +__func__, iommu->seq_id, did, addr, size_order, type); + if (cap_write_drain(iommu->cap)) dw = 1; diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c index 582fd01cb7d1..a64da83e867c 100644 --- a/drivers/iommu/intel-iommu.c +++ b/drivers/iommu/intel-iommu.c @@ -1396,6 +1396,9 @@ static void __iommu_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 val = 0, val_iva = 0; unsigned long flag; + pr_debug("%s: iommu %d did %u addr 0x%llx order %u type %llx\n", +__func__, iommu->seq_id, did, addr, size_order, type); + switch (type) { case DMA_TLB_GLOBAL_FLUSH: /* global flush doesn't need set IVA_REG */ -- 2.14.3 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 0/3] intel-iommu: fix mapping PSI missing for iommu_map()
v2: - cc correct people and iommu list (PSI stands for: Page Selective Invalidations) Intel IOMMU has the caching mode to ease emulation of the device. When that bit is set, we need to send PSIs even for newly mapped pages. However current driver is not fully obey the rule. E.g., iommu_map() API will only do the mapping but it never sent the PSIs before. That can be problematic to emulated IOMMU devices since they'll never be able to build up the shadow page tables if without such information. This patchset tries to fix the problem. Patch 1 is a tracing enhancement that helped me to triage the problem. It might even be useful in the future. Patch 2 generalized a helper to notify the MAP PSIs. Patch 3 fixes the real problem by making sure every domain mapping will trigger the MAP PSI notifications. Without the patchset, nested device assignment (assign one device firstly to L1 guest, then to L2 guest) won't work for QEMU. After applying the patchset, it works. Please review. Thanks. Peter Xu (3): intel-iommu: add some traces for PSIs intel-iommu: generalize __mapping_notify_one() intel-iommu: fix iotlb psi missing for mappings drivers/iommu/dmar.c| 3 ++ drivers/iommu/intel-iommu.c | 68 - 2 files changed, 52 insertions(+), 19 deletions(-) -- 2.14.3 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH] iommu/vt-d: use domain instead of cache fetching
after commit a1ddcbe93010 ("iommu/vt-d: Pass dmar_domain directly into iommu_flush_iotlb_psi", 2015-08-12), we have domain pointer as parameter to iommu_flush_iotlb_psi(), so no need to fetch it from cache again. More importantly, a NULL reference pointer bug is reported on RHEL7 (and it can be reproduced on some old upstream kernels too, e.g., v4.13) by unplugging an 40g nic from a VM (hard to test unplug on real host, but it should be the same): https://bugzilla.redhat.com/show_bug.cgi?id=1531367 [ 24.391863] pciehp :00:03.0:pcie004: Slot(0): Attention button pressed [ 24.393442] pciehp :00:03.0:pcie004: Slot(0): Powering off due to button press [ 29.721068] i40evf :01:00.0: Unable to send opcode 2 to PF, err I40E_ERR_QUEUE_EMPTY, aq_err OK [ 29.783557] iommu: Removing device :01:00.0 from group 3 [ 29.784662] BUG: unable to handle kernel NULL pointer dereference at 0304 [ 29.785817] IP: iommu_flush_iotlb_psi+0xcf/0x120 [ 29.786486] PGD 0 [ 29.786487] P4D 0 [ 29.786812] [ 29.787390] Oops: [#1] SMP [ 29.787876] Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_ng [ 29.795371] CPU: 0 PID: 156 Comm: kworker/0:2 Not tainted 4.13.0 #14 [ 29.796366] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.11.0-1.el7 04/01/2014 [ 29.797593] Workqueue: pciehp-0 pciehp_power_thread [ 29.798328] task: 94f5745b4a00 task.stack: b326805ac000 [ 29.799178] RIP: 0010:iommu_flush_iotlb_psi+0xcf/0x120 [ 29.799919] RSP: 0018:b326805afbd0 EFLAGS: 00010086 [ 29.800666] RAX: 94f5bc56e800 RBX: RCX: 00020025 [ 29.801667] RDX: 94f5bc56e000 RSI: 0082 RDI: [ 29.802755] RBP: b326805afbf8 R08: R09: 94f5bc86bbf0 [ 29.803772] R10: b326805afba8 R11: 000ffdc4 R12: 94f5bc86a400 [ 29.804789] R13: R14: ffdc4000 R15: [ 29.805792] FS: () GS:94f5bfc0() knlGS: [ 29.806923] CS: 0010 DS: ES: CR0: 80050033 [ 29.807736] CR2: 0304 CR3: 3499d000 CR4: 06f0 [ 29.808747] Call Trace: [ 29.809156] flush_unmaps_timeout+0x126/0x1c0 [ 29.809800] domain_exit+0xd6/0x100 [ 29.810322] device_notifier+0x6b/0x70 [ 29.810902] notifier_call_chain+0x4a/0x70 [ 29.812822] __blocking_notifier_call_chain+0x47/0x60 [ 29.814499] blocking_notifier_call_chain+0x16/0x20 [ 29.816137] device_del+0x233/0x320 [ 29.817588] pci_remove_bus_device+0x6f/0x110 [ 29.819133] pci_stop_and_remove_bus_device+0x1a/0x20 [ 29.820817] pciehp_unconfigure_device+0x7a/0x1d0 [ 29.822434] pciehp_disable_slot+0x52/0xe0 [ 29.823931] pciehp_power_thread+0x8a/0xa0 [ 29.825411] process_one_work+0x18c/0x3a0 [ 29.826875] worker_thread+0x4e/0x3b0 [ 29.828263] kthread+0x109/0x140 [ 29.829564] ? process_one_work+0x3a0/0x3a0 [ 29.831081] ? kthread_park+0x60/0x60 [ 29.832464] ret_from_fork+0x25/0x30 [ 29.833794] Code: 85 ed 74 0b 5b 41 5c 41 5d 41 5e 41 5f 5d c3 49 8b 54 24 60 44 89 f8 0f b6 c4 48 8b 04 c2 48 85 c0 74 49 45 0f b6 ff 4a 8b 3c f8 <80> bf [ 29.838514] RIP: iommu_flush_iotlb_psi+0xcf/0x120 RSP: b326805afbd0 [ 29.840362] CR2: 0304 [ 29.841716] ---[ end trace b10ec0d6900868d3 ]--- This patch fixes that problem if applied to v4.13 kernel. The bug does not exist on latest upstream kernel since it's fixed as a side effect of commit 13cf01744608 ("iommu/vt-d: Make use of iova deferred flushing", 2017-08-15). But IMHO it's still good to have this patch upstream. CC: Alex Williamson <alex.william...@redhat.com> Signed-off-by: Peter Xu <pet...@redhat.com> --- drivers/iommu/intel-iommu.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c index 4a2de34895ec..869f37d1f3b7 100644 --- a/drivers/iommu/intel-iommu.c +++ b/drivers/iommu/intel-iommu.c @@ -1601,8 +1601,7 @@ static void iommu_flush_iotlb_psi(struct intel_iommu *iommu, * flush. However, device IOTLB doesn't need to be flushed in this case. */ if (!cap_caching_mode(iommu->cap) || !map) - iommu_flush_dev_iotlb(get_iommu_domain(iommu, did), - addr, mask); + iommu_flush_dev_iotlb(domain, addr, mask); } static void iommu_flush_iova(struct iova_domain *iovad) -- 2.14.3 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] iommu/vt-d: Fix shift overflow in qi_flush_dev_iotlb
On Wed, Dec 13, 2017 at 11:31:02AM -0600, Hook, Gary wrote: > On 12/13/2017 11:15 AM, Alex Williamson wrote: > > On Wed, 13 Dec 2017 10:41:47 -0600 > > "Hook, Gary" <gh...@amd.com> wrote: > > > > > On 12/13/2017 9:58 AM, Alex Williamson wrote: > > > > On Wed, 13 Dec 2017 15:13:55 +0800 > > > > Peter Xu <pet...@redhat.com> wrote: > > > > > On Tue, Dec 12, 2017 at 03:43:08PM -0700, Alex Williamson wrote: > > > > > > > > > > [...] > > > > > > diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c > > > > > > index 9a7ffd13c7f0..87888b102057 100644 > > > > > > --- a/drivers/iommu/dmar.c > > > > > > +++ b/drivers/iommu/dmar.c > > > > > > @@ -1345,7 +1345,9 @@ void qi_flush_dev_iotlb(struct intel_iommu > > > > > > *iommu, u16 sid, u16 qdep, > > > > > > struct qi_desc desc; > > > > > > if (mask) { > > > > > > - BUG_ON(addr & ((1 << (VTD_PAGE_SHIFT + mask)) - 1)); > > > > > > + BUG_ON((mask > MAX_AGAW_PFN_WIDTH) || > > > > > > + ((mask == MAX_AGAW_PFN_WIDTH) && addr) || > > > > > > + (addr & ((1 << (VTD_PAGE_SHIFT + mask)) - 1))); > > > > > > > > > > Could it work if we just use 1ULL instead of 1 here? Thanks, > > > > > > > > In either case we're talking about shifting off the end of the > > > > variable, which I understand to be undefined. Right? Thanks, > > > > > > How so? Bits fall off the left (MSB) end, zeroes fill in the right (LSB) > > > end. I believe that behavior is pretty set. > > > > Maybe I'm relying too much on stackoverflow, but: > > > > https://stackoverflow.com/questions/11270492/what-does-the-c-standard-say-about-bitshifting-more-bits-than-the-width-of-type > > No, probably not. I don't have my copy of c99 handy, so can't check it. But > it is beyond me why any compiler implementation would choose to use a rotate > instead of a shift... probably a performance issue. > > So, yeah, when you have silly parameters, you get what you get. > > I'll stick to my suggestion. Which seems unambiguous... but I could be > wrong. Hi, Alex, Hook, I did a quick test: xz-mi:tmp $ cat a.c #include void main(void) { unsigned long long val = 0x8001ULL; int shift; printf("origin: 0x%llx\n", val); shift = 1; printf("shl 1: 0x%llx\n", val << shift); shift = 62; printf("shl 62: 0x%llx\n", val << shift); shift = 63; printf("shl 63: 0x%llx\n", val << shift); shift = 64; printf("shl 64: 0x%llx\n", val << shift); shift = 65; printf("shl 65: 0x%llx\n", val << shift); } xz-mi:tmp $ gcc -std=c99 a.c xz-mi:tmp $ ./a.out origin: 0x8001 shl 1: 0x2 shl 62: 0x4000 shl 63: 0x8000 shl 64: 0x8001 shl 65: 0x2 xz-mi:tmp $ objdump -d a.out | grep -A20 "" 004004d7 : 4004d7: 55 push %rbp 4004d8: 48 89 e5mov%rsp,%rbp 4004db: 48 83 ec 10 sub$0x10,%rsp 4004df: 48 b8 01 00 00 00 00movabs $0x8001,%rax 4004e6: 00 00 80 4004e9: 48 89 45 f8 mov%rax,-0x8(%rbp) 4004ed: 48 8b 45 f8 mov-0x8(%rbp),%rax 4004f1: 48 89 c6mov%rax,%rsi 4004f4: bf 60 06 40 00 mov$0x400660,%edi 4004f9: b8 00 00 00 00 mov$0x0,%eax 4004fe: e8 ed fe ff ff callq 4003f0 <printf@plt> 400503: c7 45 f4 01 00 00 00movl $0x1,-0xc(%rbp) 40050a: 8b 45 f4 mov-0xc(%rbp),%eax 40050d: 48 8b 55 f8 mov-0x8(%rbp),%rdx 400511: 89 c1 mov%eax,%ecx 400513: 48 d3 e2shl%cl,%rdx 400516: 48 89 d0mov%rdx,%rax 400519: 48 89 c6mov%rax,%rsi 40051c: bf 70 06 40 00 mov$0x400670,%edi 400521: b8 00 00 00 00 mov$0x0,%eax So it seems not really a rotation operation, but it looks more like convering a "shifting N" into "shifting N%64" when N>=64. So now I agree with Alex's change. Thanks all. -- Peter Xu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] iommu/vt-d: Fix shift overflow in qi_flush_dev_iotlb
On Tue, Dec 12, 2017 at 03:43:08PM -0700, Alex Williamson wrote: [...] > diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c > index 9a7ffd13c7f0..87888b102057 100644 > --- a/drivers/iommu/dmar.c > +++ b/drivers/iommu/dmar.c > @@ -1345,7 +1345,9 @@ void qi_flush_dev_iotlb(struct intel_iommu *iommu, u16 > sid, u16 qdep, > struct qi_desc desc; > > if (mask) { > - BUG_ON(addr & ((1 << (VTD_PAGE_SHIFT + mask)) - 1)); > + BUG_ON((mask > MAX_AGAW_PFN_WIDTH) || > +((mask == MAX_AGAW_PFN_WIDTH) && addr) || > +(addr & ((1 << (VTD_PAGE_SHIFT + mask)) - 1))); Could it work if we just use 1ULL instead of 1 here? Thanks, -- Peter Xu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH 03/20] intel_iommu: add "svm" option
On Mon, May 08, 2017 at 10:38:09AM +, Liu, Yi L wrote: > On Thu, 27 Apr 2017 18:53:17 +0800 > Peter Xu <pet...@redhat.com> wrote: > > > On Wed, Apr 26, 2017 at 06:06:33PM +0800, Liu, Yi L wrote: > > > Expose "Shared Virtual Memory" to guest by using "svm" option. > > > Also use "svm" to expose SVM related capabilities to guest. > > > e.g. "-device intel-iommu, svm=on" > > > > > > Signed-off-by: Liu, Yi L <yi.l@linux.intel.com> > > > --- > > > hw/i386/intel_iommu.c | 10 ++ > > > hw/i386/intel_iommu_internal.h | 5 + > > > include/hw/i386/intel_iommu.h | 1 + > > > 3 files changed, 16 insertions(+) > > > > > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c index > > > bf98fa5..ba1e7eb 100644 > > > --- a/hw/i386/intel_iommu.c > > > +++ b/hw/i386/intel_iommu.c > > > @@ -2453,6 +2453,7 @@ static Property vtd_properties[] = { > > > DEFINE_PROP_BOOL("x-buggy-eim", IntelIOMMUState, buggy_eim, false), > > > DEFINE_PROP_BOOL("caching-mode", IntelIOMMUState, caching_mode, > > FALSE), > > > DEFINE_PROP_BOOL("ecs", IntelIOMMUState, ecs, FALSE), > > > +DEFINE_PROP_BOOL("svm", IntelIOMMUState, svm, FALSE), > > > DEFINE_PROP_END_OF_LIST(), > > > }; > > > > > > @@ -2973,6 +2974,15 @@ static void vtd_init(IntelIOMMUState *s) > > > s->ecap |= VTD_ECAP_ECS; > > > } > > > > > > +if (s->svm) { > > > +if (!s->ecs || !x86_iommu->pt_supported || !s->caching_mode) { > > > +error_report("Need to set ecs, pt, caching-mode for svm"); > > > +exit(1); > > > +} > > > +s->cap |= VTD_CAP_DWD | VTD_CAP_DRD; > > > +s->ecap |= VTD_ECAP_PRS | VTD_ECAP_PTS | VTD_ECAP_PASID28; > > > +} > > > + > > > if (s->caching_mode) { > > > s->cap |= VTD_CAP_CM; > > > } > > > diff --git a/hw/i386/intel_iommu_internal.h > > > b/hw/i386/intel_iommu_internal.h index 71a1c1e..f2a7d12 100644 > > > --- a/hw/i386/intel_iommu_internal.h > > > +++ b/hw/i386/intel_iommu_internal.h > > > @@ -191,6 +191,9 @@ > > > #define VTD_ECAP_PT (1ULL << 6) > > > #define VTD_ECAP_MHMV (15ULL << 20) > > > #define VTD_ECAP_ECS(1ULL << 24) > > > +#define VTD_ECAP_PASID28(1ULL << 28) > > > > Could I ask what's this bit? On my spec, it says this bit is reserved and > > defunct (spec > > version: June 2016). > > As Ashok confirmed, yes it should be bit 40. would update it. Ok. > > > > +#define VTD_ECAP_PRS(1ULL << 29) > > > +#define VTD_ECAP_PTS(0xeULL << 35) > > > > Would it better we avoid using 0xe here, or at least add some comment? > > For this value, it must be no more than the bits host supports. So it may be > better to have a default value and meanwhile expose an option to let user > set it. how about your opinion? I think a more important point is that we need to make sure this value is no larger than hardware support? Since you are also working on the vfio interface for virt-svm... would it be possible that we can talk to kernel in some way so that we can know the supported pasid size in host IOMMU? So that when guest specifies something bigger, we can stop the user. I don't know the practical value for this field, if it's static enough, I think it's also okay we make it static here as well. But again, I would prefer at least some comment, like: /* Value N indicates PASID field of N+1 bits, here 0xe stands for.. */ > > > > > > > > > /* CAP_REG */ > > > /* (offset >> 4) << 24 */ > > > @@ -207,6 +210,8 @@ > > > #define VTD_CAP_PSI (1ULL << 39) > > > #define VTD_CAP_SLLPS ((1ULL << 34) | (1ULL << 35)) > > > #define VTD_CAP_CM (1ULL << 7) > > > +#define VTD_CAP_DWD (1ULL << 54) > > > +#define VTD_CAP_DRD (1ULL << 55) > > > > Just to confirm: after this series, we should support drain read/write > > then, right? > > I haven’t done special process against it in IOMMU emulator. It's set to keep > consistence with VT-d spec since DWD and DRW is required capability when > PASID it reported as Set. However, I think it should be fine if guest issue QI > with drain read/write set in the descriptor. Host should be able to process > it. I see. IIUC the point here is we need to deliver these requests to host IOMMU, and I guess we need to be able to do this in a synchronous way as well. Thanks, -- Peter Xu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH 03/20] intel_iommu: add "svm" option
On Wed, Apr 26, 2017 at 06:06:33PM +0800, Liu, Yi L wrote: > Expose "Shared Virtual Memory" to guest by using "svm" option. > Also use "svm" to expose SVM related capabilities to guest. > e.g. "-device intel-iommu, svm=on" > > Signed-off-by: Liu, Yi L <yi.l@linux.intel.com> > --- > hw/i386/intel_iommu.c | 10 ++ > hw/i386/intel_iommu_internal.h | 5 + > include/hw/i386/intel_iommu.h | 1 + > 3 files changed, 16 insertions(+) > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c > index bf98fa5..ba1e7eb 100644 > --- a/hw/i386/intel_iommu.c > +++ b/hw/i386/intel_iommu.c > @@ -2453,6 +2453,7 @@ static Property vtd_properties[] = { > DEFINE_PROP_BOOL("x-buggy-eim", IntelIOMMUState, buggy_eim, false), > DEFINE_PROP_BOOL("caching-mode", IntelIOMMUState, caching_mode, FALSE), > DEFINE_PROP_BOOL("ecs", IntelIOMMUState, ecs, FALSE), > +DEFINE_PROP_BOOL("svm", IntelIOMMUState, svm, FALSE), > DEFINE_PROP_END_OF_LIST(), > }; > > @@ -2973,6 +2974,15 @@ static void vtd_init(IntelIOMMUState *s) > s->ecap |= VTD_ECAP_ECS; > } > > +if (s->svm) { > +if (!s->ecs || !x86_iommu->pt_supported || !s->caching_mode) { > +error_report("Need to set ecs, pt, caching-mode for svm"); > +exit(1); > +} > +s->cap |= VTD_CAP_DWD | VTD_CAP_DRD; > +s->ecap |= VTD_ECAP_PRS | VTD_ECAP_PTS | VTD_ECAP_PASID28; > +} > + > if (s->caching_mode) { > s->cap |= VTD_CAP_CM; > } > diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h > index 71a1c1e..f2a7d12 100644 > --- a/hw/i386/intel_iommu_internal.h > +++ b/hw/i386/intel_iommu_internal.h > @@ -191,6 +191,9 @@ > #define VTD_ECAP_PT (1ULL << 6) > #define VTD_ECAP_MHMV (15ULL << 20) > #define VTD_ECAP_ECS(1ULL << 24) > +#define VTD_ECAP_PASID28(1ULL << 28) Could I ask what's this bit? On my spec, it says this bit is reserved and defunct (spec version: June 2016). > +#define VTD_ECAP_PRS(1ULL << 29) > +#define VTD_ECAP_PTS(0xeULL << 35) Would it better we avoid using 0xe here, or at least add some comment? > > /* CAP_REG */ > /* (offset >> 4) << 24 */ > @@ -207,6 +210,8 @@ > #define VTD_CAP_PSI (1ULL << 39) > #define VTD_CAP_SLLPS ((1ULL << 34) | (1ULL << 35)) > #define VTD_CAP_CM (1ULL << 7) > +#define VTD_CAP_DWD (1ULL << 54) > +#define VTD_CAP_DRD (1ULL << 55) Just to confirm: after this series, we should support drain read/write then, right? Thanks, > > /* Supported Adjusted Guest Address Widths */ > #define VTD_CAP_SAGAW_SHIFT 8 > diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h > index ae21fe5..8981615 100644 > --- a/include/hw/i386/intel_iommu.h > +++ b/include/hw/i386/intel_iommu.h > @@ -267,6 +267,7 @@ struct IntelIOMMUState { > > bool caching_mode; /* RO - is cap CM enabled? */ > bool ecs; /* Extended Context Support */ > +bool svm; /* Shared Virtual Memory */ > > dma_addr_t root;/* Current root table pointer */ > bool root_extended; /* Type of root table (extended or not) > */ > -- > 1.9.1 > -- Peter Xu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [Qemu-devel] [RFC PATCH 12/20] Memory: Add func to fire pasidt_bind notifier
On Thu, Apr 27, 2017 at 06:25:37PM +0800, Liu, Yi L wrote: > On Thu, Apr 27, 2017 at 02:14:27PM +0800, Peter Xu wrote: > > On Thu, Apr 27, 2017 at 10:37:19AM +0800, Liu, Yi L wrote: > > > On Wed, Apr 26, 2017 at 03:50:16PM +0200, Paolo Bonzini wrote: > > > > > > > > > > > > On 26/04/2017 12:06, Liu, Yi L wrote: > > > > > +void memory_region_notify_iommu_svm_bind(MemoryRegion *mr, > > > > > + void *data) > > > > > +{ > > > > > +IOMMUNotifier *iommu_notifier; > > > > > +IOMMUNotifierFlag request_flags; > > > > > + > > > > > +assert(memory_region_is_iommu(mr)); > > > > > + > > > > > +/*TODO: support other bind requests with smaller gran, > > > > > + * e.g. bind signle pasid entry > > > > > + */ > > > > > +request_flags = IOMMU_NOTIFIER_SVM_PASIDT_BIND; > > > > > + > > > > > +QLIST_FOREACH(iommu_notifier, >iommu_notify, node) { > > > > > +if (iommu_notifier->notifier_flags & request_flags) { > > > > > +iommu_notifier->notify(iommu_notifier, data); > > > > > +break; > > > > > +} > > > > > +} > > > > > > > > Peter, > > > > > > > > should this reuse ->notify, or should it be different function pointer > > > > in IOMMUNotifier? > > > > > > Hi Paolo, > > > > > > Thx for your review. > > > > > > I think it should be “->notify” here. In this patchset, the new notifier > > > is registered with the existing notifier registration API. So the all the > > > notifiers are in the mr->iommu_notify list. And notifiers are labeled > > > by notify flag, so it is able to differentiate the IOMMUNotifier nodes. > > > When the flag meets, trigger it by “->notify”. The diagram below shows > > > my understanding , wish it helps to make me understood. > > > > > > VFIOContainer > > >| > > >giommu_list(VFIOGuestIOMMU) > > > \ > > > VFIOGuestIOMMU1 -> VFIOGuestIOMMU2 -> VFIOGuestIOMMU3 > > > ... > > > | | | > > > mr->iommu_notify: IOMMUNotifier ->IOMMUNotifier -> IOMMUNotifier > > > (Flag:MAP/UNMAP) (Flag:SVM bind) (Flag:tlb > > > invalidate) > > > > > > > > > Actually, compared with the MAP/UNMAP notifier, the newly added notifier > > > has > > > no start/end check, and there may be other types of bind notfier flag in > > > future, so I added a separate fire func for SVM bind notifier. > > > > I agree with Paolo that this interface might not be the suitable place > > for the SVM notifiers (just like what I worried about in previous > > discussions). > > > > The biggest problem is that, if you see current notifier mechanism, > > it's per-memory-region. However iiuc your messages should be > > per-iommu, or say, per translation unit. > > Hi Peter, > > yes, you're right. the newly added notifier is per-iommu. > > > While, for each iommu, there > > can be more than one memory regions (ppc can be an example). When > > there are more than one MRs binded to the same iommu unit, which > > memory region should you register to? Any one of them, or all? > > Honestly, I'm not expert on ppc. According to the current code, > I can only find one MR initialized with memory_region_init_iommu() > in spapr_tce_table_realize(). So to better get your point, let me > check. Do you mean there may be multiple of iommu MRs behind a iommu? I am not either. :) But yes, that's what I mean. At least that's how I understand it. > > I admit it must be considered if there are multiple iommu MRs. I may > choose to register for one of them since the notifier is per-iommu as > you've pointed. Then vIOMMU emulator need to trigger the notifier with > the correct MR. Not sure if ppc vIOMMU is fine with it. > > > So my conclusion is, it just has nothing to do with memory regions... > > > > Instead of a different function pointer in IOMMUNotifer, IMHO we can > > even move a step further, to isolate IOTLB notifications (targeted at > > memory regions and with start/end ranges) out of SVM/other > > notifications, since they are diffe
Re: [RFC PATCH 02/20] intel_iommu: exposed extended-context mode to guest
bus_num, > return ret_fr; > } > > -if (!vtd_root_entry_present()) { > +if (!vtd_root_entry_present() || > +(s->ecs && (devfn > 0x7f) && (!vtd_root_entry_upper_present( { > /* Not error - it's okay we don't have root entry. */ > trace_vtd_re_not_present(bus_num); > return -VTD_FR_ROOT_ENTRY_P; > -} else if (re.rsvd || (re.val & VTD_ROOT_ENTRY_RSVD)) { > -trace_vtd_re_invalid(re.rsvd, re.val); > -return -VTD_FR_ROOT_ENTRY_RSVD; > +} > +if ((s->ecs && (devfn > 0x7f) && (re.rsvd & VTD_ROOT_ENTRY_RSVD)) || > +(s->ecs && (devfn < 0x80) && (re.val & VTD_ROOT_ENTRY_RSVD)) || > +((!s->ecs) && (re.rsvd || (re.val & VTD_ROOT_ENTRY_RSVD { > +trace_vtd_re_invalid(re.rsvd, re.val); > +return -VTD_FR_ROOT_ENTRY_RSVD; Nit: I feel like we can better wrap these 0x7f and 0x80 into helper functions, especially if with above structure change... (will hold here...) Thanks, -- Peter Xu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [Qemu-devel] [RFC PATCH 12/20] Memory: Add func to fire pasidt_bind notifier
On Thu, Apr 27, 2017 at 02:14:27PM +0800, Peter Xu wrote: > On Thu, Apr 27, 2017 at 10:37:19AM +0800, Liu, Yi L wrote: > > On Wed, Apr 26, 2017 at 03:50:16PM +0200, Paolo Bonzini wrote: > > > > > > > > > On 26/04/2017 12:06, Liu, Yi L wrote: > > > > +void memory_region_notify_iommu_svm_bind(MemoryRegion *mr, > > > > + void *data) > > > > +{ > > > > +IOMMUNotifier *iommu_notifier; > > > > +IOMMUNotifierFlag request_flags; > > > > + > > > > +assert(memory_region_is_iommu(mr)); > > > > + > > > > +/*TODO: support other bind requests with smaller gran, > > > > + * e.g. bind signle pasid entry > > > > + */ > > > > +request_flags = IOMMU_NOTIFIER_SVM_PASIDT_BIND; > > > > + > > > > +QLIST_FOREACH(iommu_notifier, >iommu_notify, node) { > > > > +if (iommu_notifier->notifier_flags & request_flags) { > > > > +iommu_notifier->notify(iommu_notifier, data); > > > > +break; > > > > +} > > > > +} > > > > > > Peter, > > > > > > should this reuse ->notify, or should it be different function pointer > > > in IOMMUNotifier? > > > > Hi Paolo, > > > > Thx for your review. > > > > I think it should be “->notify” here. In this patchset, the new notifier > > is registered with the existing notifier registration API. So the all the > > notifiers are in the mr->iommu_notify list. And notifiers are labeled > > by notify flag, so it is able to differentiate the IOMMUNotifier nodes. > > When the flag meets, trigger it by “->notify”. The diagram below shows > > my understanding , wish it helps to make me understood. > > > > VFIOContainer > >| > >giommu_list(VFIOGuestIOMMU) > > \ > > VFIOGuestIOMMU1 -> VFIOGuestIOMMU2 -> VFIOGuestIOMMU3 ... > > | | | > > mr->iommu_notify: IOMMUNotifier ->IOMMUNotifier -> IOMMUNotifier > > (Flag:MAP/UNMAP) (Flag:SVM bind) (Flag:tlb > > invalidate) > > > > > > Actually, compared with the MAP/UNMAP notifier, the newly added notifier has > > no start/end check, and there may be other types of bind notfier flag in > > future, so I added a separate fire func for SVM bind notifier. > > I agree with Paolo that this interface might not be the suitable place > for the SVM notifiers (just like what I worried about in previous > discussions). > > The biggest problem is that, if you see current notifier mechanism, > it's per-memory-region. However iiuc your messages should be > per-iommu, or say, per translation unit. While, for each iommu, there > can be more than one memory regions (ppc can be an example). When > there are more than one MRs binded to the same iommu unit, which > memory region should you register to? Any one of them, or all? > > So my conclusion is, it just has nothing to do with memory regions... > > Instead of a different function pointer in IOMMUNotifer, IMHO we can > even move a step further, to isolate IOTLB notifications (targeted at > memory regions and with start/end ranges) out of SVM/other > notifications, since they are different in general. So we basically > need two notification mechanism: > > - one for memory regions, currently what I can see is IOTLB > notifications > > - one for translation units, currently I see all the rest of > notifications needed in virt-svm in this category > > Maybe some RFC patches would be good to show what I mean... I'll see > whether I can prepare some. Here it is (on qemu-devel): [RFC PATCH 0/8] IOMMU: introduce common IOMMUObject Thanks, -- Peter Xu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [Qemu-devel] [RFC PATCH 12/20] Memory: Add func to fire pasidt_bind notifier
On Thu, Apr 27, 2017 at 10:37:19AM +0800, Liu, Yi L wrote: > On Wed, Apr 26, 2017 at 03:50:16PM +0200, Paolo Bonzini wrote: > > > > > > On 26/04/2017 12:06, Liu, Yi L wrote: > > > +void memory_region_notify_iommu_svm_bind(MemoryRegion *mr, > > > + void *data) > > > +{ > > > +IOMMUNotifier *iommu_notifier; > > > +IOMMUNotifierFlag request_flags; > > > + > > > +assert(memory_region_is_iommu(mr)); > > > + > > > +/*TODO: support other bind requests with smaller gran, > > > + * e.g. bind signle pasid entry > > > + */ > > > +request_flags = IOMMU_NOTIFIER_SVM_PASIDT_BIND; > > > + > > > +QLIST_FOREACH(iommu_notifier, >iommu_notify, node) { > > > +if (iommu_notifier->notifier_flags & request_flags) { > > > +iommu_notifier->notify(iommu_notifier, data); > > > +break; > > > +} > > > +} > > > > Peter, > > > > should this reuse ->notify, or should it be different function pointer > > in IOMMUNotifier? > > Hi Paolo, > > Thx for your review. > > I think it should be “->notify” here. In this patchset, the new notifier > is registered with the existing notifier registration API. So the all the > notifiers are in the mr->iommu_notify list. And notifiers are labeled > by notify flag, so it is able to differentiate the IOMMUNotifier nodes. > When the flag meets, trigger it by “->notify”. The diagram below shows > my understanding , wish it helps to make me understood. > > VFIOContainer >| >giommu_list(VFIOGuestIOMMU) > \ > VFIOGuestIOMMU1 -> VFIOGuestIOMMU2 -> VFIOGuestIOMMU3 ... > | | | > mr->iommu_notify: IOMMUNotifier ->IOMMUNotifier -> IOMMUNotifier > (Flag:MAP/UNMAP) (Flag:SVM bind) (Flag:tlb invalidate) > > > Actually, compared with the MAP/UNMAP notifier, the newly added notifier has > no start/end check, and there may be other types of bind notfier flag in > future, so I added a separate fire func for SVM bind notifier. I agree with Paolo that this interface might not be the suitable place for the SVM notifiers (just like what I worried about in previous discussions). The biggest problem is that, if you see current notifier mechanism, it's per-memory-region. However iiuc your messages should be per-iommu, or say, per translation unit. While, for each iommu, there can be more than one memory regions (ppc can be an example). When there are more than one MRs binded to the same iommu unit, which memory region should you register to? Any one of them, or all? So my conclusion is, it just has nothing to do with memory regions... Instead of a different function pointer in IOMMUNotifer, IMHO we can even move a step further, to isolate IOTLB notifications (targeted at memory regions and with start/end ranges) out of SVM/other notifications, since they are different in general. So we basically need two notification mechanism: - one for memory regions, currently what I can see is IOTLB notifications - one for translation units, currently I see all the rest of notifications needed in virt-svm in this category Maybe some RFC patches would be good to show what I mean... I'll see whether I can prepare some. Thanks, -- Peter Xu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu