Re: [PATCH 0/5] iommu/vt-d: Consolidate various cache flush ops
Hi Jacob, On 12/3/19 4:02 AM, Jacob Pan wrote: On Fri, 22 Nov 2019 11:04:44 +0800 Lu Baolu wrote: Intel VT-d 3.0 introduces more caches and interfaces for software to flush when it runs in the scalable mode. Currently various cache flush helpers are scattered around. This consolidates them by putting them in the existing iommu_flush structure. /* struct iommu_flush - Intel IOMMU cache invalidation ops * * @cc_inv: invalidate context cache * @iotlb_inv: Invalidate IOTLB and paging structure caches when software * has changed second-level tables. * @p_iotlb_inv: Invalidate IOTLB and paging structure caches when software * has changed first-level tables. * @pc_inv: invalidate pasid cache * @dev_tlb_inv: invalidate cached mappings used by requests-without-PASID * from the Device-TLB on a endpoint device. * @p_dev_tlb_inv: invalidate cached mappings used by requests-with-PASID * from the Device-TLB on an endpoint device */ struct iommu_flush { void (*cc_inv)(struct intel_iommu *iommu, u16 did, u16 sid, u8 fm, u64 type); void (*iotlb_inv)(struct intel_iommu *iommu, u16 did, u64 addr, unsigned int size_order, u64 type); void (*p_iotlb_inv)(struct intel_iommu *iommu, u16 did, u32 pasid, u64 addr, unsigned long npages, bool ih); void (*pc_inv)(struct intel_iommu *iommu, u16 did, u32 pasid, u64 granu); void (*dev_tlb_inv)(struct intel_iommu *iommu, u16 sid, u16 pfsid, u16 qdep, u64 addr, unsigned int mask); void (*p_dev_tlb_inv)(struct intel_iommu *iommu, u16 sid, u16 pfsid, u32 pasid, u16 qdep, u64 addr, unsigned long npages); }; The name of each cache flush ops is defined according to the spec section 6.5 so that people are easy to look up them in the spec. Nice consolidation. For nested SVM, I also introduced cache flushed helpers as needed. https://lkml.org/lkml/2019/10/24/857 Should I wait for yours to be merged or you want to extend the this consolidation after SVA/SVM cache flush? I expect to send my v8 shortly. Please base your v8 patch on this series. So it could get more chances for test. I will queue this patch series for internal test after 5.5-rc1 and if everything goes well, I will forward it to Joerg around rc4 for linux- next. Best regards, baolu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 5/8] iommu/vt-d: Add first level page table interfaces
Hi, On 12/3/19 7:27 AM, Jacob Pan wrote: On Thu, 28 Nov 2019 10:25:47 +0800 Lu Baolu wrote: This adds functions to manipulate first level page tables which could be used by a scalale mode capable IOMMU unit. FL and SL page tables are very similar, and I presume we are not using all the flag bits in FL paging structures for DMA mapping. Are there enough relevant differences to warrant a new set of helper functions for FL? Or we can merge into one. I ever thought about this and I am still open for this suggestion. We had a quick compare on these two page tables. The only concern is the read/write/present encoding. The present bit in first level implies read permission while second level page table explicitly has a READ bit. (recalled from memory, correct me if it's bad. :-)). Anyway, let's listen to more opinions. Best regards, baolu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 0/8] Use 1st-level for DMA remapping
Hi Jacob, Thanks for reviewing it. On 12/3/19 4:19 AM, Jacob Pan wrote: On Thu, 28 Nov 2019 10:25:42 +0800 Lu Baolu wrote: Intel VT-d in scalable mode supports two types of page talbes tables Got it, thanks! for DMA translation: the first level page table and the second level page table. The first level page table uses the same format as the CPU page table, while the second level page table keeps compatible with previous formats. The software is able to choose any one of them for DMA remapping according to the use case. This patchset aims to move IOVA (I/O Virtual Address) translation move guest IOVA only, right? No. In v1, only for guest IOVA. This has been changed since v2 according to comments during v1 review period. v2 will use first level for both host and guest unless nested mode. to 1st-level page table in scalable mode. This will simplify vIOMMU (IOMMU simulated by VM hypervisor) design by using the two-stage translation, a.k.a. nested mode translation. As Intel VT-d architecture offers caching mode, guest IOVA (GIOVA) support is now implemented in a shadow page manner. The device simulation software, like QEMU, has to figure out GIOVA->GPA mappings and write them to a shadowed page table, which will be used by the physical IOMMU. Each time when mappings are created or destroyed in vIOMMU, the simulation software has to intervene. Hence, the changes on GIOVA->GPA could be shadowed to host. .---. | vIOMMU | |---| .. | |IOTLB flush trap |QEMU| .---. (map/unmap) || |GIOVA->GPA |>|.. | '---' || GIOVA->HPA | | | | |'' | '---' || || '' | < | v VFIO/IOMMU API .---. | pIOMMU | |---| | | .---. |GIOVA->HPA | '---' | | '---' In VT-d 3.0, scalable mode is introduced, which offers two-level translation page tables and nested translation mode. Regards to GIOVA support, it can be simplified by 1) moving the GIOVA support over 1st-level page table to store GIOVA->GPA mapping in vIOMMU, 2) binding vIOMMU 1st level page table to the pIOMMU, 3) using pIOMMU second level for GPA->HPA translation, and 4) enable nested (a.k.a. dual-stage) translation in host. Compared with current shadow GIOVA support, the new approach makes the vIOMMU design simpler and more efficient as we only need to flush the pIOMMU IOTLB and possible device-IOTLB when an IOVA mapping in vIOMMU is torn down. .---. | vIOMMU | |---| .---. | |IOTLB flush trap | QEMU| .---.(unmap) |---| |GIOVA->GPA |>| | '---' '---' | | | '---' | <-- | VFIO/IOMMU | cache invalidation and | guest gpd bind interfaces v .---. | pIOMMU | |---| .---. |GIOVA->GPA |<---First level '---' | GPA->HPA |<---Scond level '---' '---' This patch set includes two parts. The former part implements the per-domain page table abstraction, which makes the page table difference transparent to various map/unmap APIs. The later part s/later/latter/ applies the first level page table for IOVA translation unless the DOMAIN_ATTR_NESTING domain attribution has been set, which indicates nested mode in use. Maybe I am reading this wrong, but shouldn't it be the opposite? i.e. Use FL page table for IOVA if it is a nesting domain? My description seems to a bit confusing. If DOMAIN_ATTR_NESTING is set for a domain, the second level will be used to map gPA (guest physical address) to hPA (host physical address), and the mappings between gVA ( guest virtual address) and gPA will be maintained by the guest with the page table address binding to host's first level. Otherwise, first level will be used for mapping between gPA and hPA, or IOVA and DMA address. Best regards, baolu Based-on-idea-by: Ashok Raj Based-on-idea-by: Kevin Tian Based-on-idea-by: Liu Yi L Based-on-idea-by: Jacob Pan Based-on-idea-by: Sanjay Kumar Based-on-idea-by: Lu Baolu Change log: v1->v2 - The first series was posted here https://lkml.org/lkml/2019/9/23/297 - Use p
Re: dmar pte read access not set error messages on hp dl388 gen8 systems
Hi, On 12/3/19 12:13 AM, Jerry Snitselaar wrote: On Mon Dec 02 19, Jerry Snitselaar wrote: On Mon Dec 02 19, Lu Baolu wrote: Hi, On 12/2/19 2:34 PM, Jerry Snitselaar wrote: We are seeing DMAR PTE read access not set errors when booting a kernel with default passthrough, both with a test kernel and with a 5.4.0 kernel. Previously we would see a number of identity mappings being set related to the rmrrs, and now they aren't seen and we get the dmar pte errors as devices touch those regions. From what I can tell currently df4f3c603aeb ("iommu/vt-d: Remove static identity map code") removed the bit of code in init_dmars that used to set up those mappings: - /* - * For each rmrr - * for each dev attached to rmrr - * do - * locate drhd for dev, alloc domain for dev - * allocate free domain - * allocate page table entries for rmrr - * if context not allocated for bus - * allocate and init context - * set present in root table for this bus - * init context with domain, translation etc - * endfor - * endfor - */ - pr_info("Setting RMRR:\n"); - for_each_rmrr_units(rmrr) { - /* some BIOS lists non-exist devices in DMAR table. */ - for_each_active_dev_scope(rmrr->devices, rmrr->devices_cnt, - i, dev) { - ret = iommu_prepare_rmrr_dev(rmrr, dev); - if (ret) - pr_err("Mapping reserved region failed\n"); - } - } si_domain_init now has code that sets identity maps for devices in rmrrs, but only for certain devices. On which device, are you seeing this error? Is it a rmrr locked device? Best regards, baolu Almost all of the messages are for the ilo, but there also is a message for the smart array raid bus controller. Also seeing it with a dl380 gen9 system, where the raid bus controller is getting the error. Does it help if you remove if (device_is_rmrr_locked(dev)) continue; in si_domain_init()? Best regards, baolu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC v2 3/3] vfio/type1: bind guest pasid (guest page tables) to host
On Mon, 25 Nov 2019 07:45:18 + "Liu, Yi L" wrote: > Hi Alex, > > Thanks for the review. Here I'd like to conclude the major opens in this > thread and see if we can get some agreements to prepare a new version. > > a) IOCTLs for BIND_GPASID and BIND_PROCESS, share a single IOCTL or two >separate IOCTLs? >Yi: It may be helpful to have separate IOCTLs. The bind data conveyed >for BIND_GPASID and BIND_PROCESS are totally different, and the struct >iommu_gpasid_bind_data has vendor specific data and may even have more >versions in future. To better maintain it, I guess separate IOCTLs for >the two bind types would be better. The structure for BIND_GPASID is >as below: > > struct vfio_iommu_type1_bind { > __u32 argsz; > struct iommu_gpasid_bind_data bind_data; > }; We've been rather successful at extending ioctls in vfio and I'm generally opposed to rampant ioctl proliferation. If we added @flags to the above struct (as pretty much the standard for vfio ioctls), then we could use it to describe the type of binding to perform and therefore the type of data provided. I think my major complaint here was that we were defining PROCESS but not implementing it. We can design the ioctl to enable it, but not define it until it's implemented. > b) how kernel-space learns the number of bytes to be copied (a.k.a. the >usage of @version field and @format field of struct >iommu_gpasid_bind_data) >Yi: Jean has an excellent recap in prior reply on the plan of future >extensions regards to @version field and @format field. Based on the >plan, kernel space needs to parse the @version field and @format field >to get the length of the current BIND_GPASID request. Also kernel needs >to maintain the new and old structure versions. Follow specific >deprecation policy in future. Yes, it seems reasonable, so from the struct above (plus @flags) we could determine we have struct iommu_gpasid_bind_data as the payload and read that using @version and @format as outlined. > c) how can vIOMMU emulator know that the vfio interface supports to config >dual stage translation for vIOMMU? >Yi: may do it via VFIO_IOMMU_GET_INFO. Yes please. > d) how can vIOMMU emulator know what @version and @format should be set >in struct iommu_gpasid_bind_data? >Yi: currently, we have two ways. First one, may do it via >VFIO_IOMMU_GET_INFO. This is a natural idea as here @version and @format >are used in vfio apis. It makes sense to let vfio to provide related info >to vIOMMU emulator after checking with vendor specific iommu driver. Also, >there is idea to do it via sysfs (/sys/class/iommu/dmar#) as we have plan >to do IOMMU capability sync between vIOMMU and pIOMMU via sysfs. I have >two concern on this option. Current iommu sysfs only provides vendor >specific hardware infos. I'm not sure if it is good to expose infos >defined in IOMMU generic layer via iommu sysfs. If this concern is not >a big thing, I'm fine with both options. This seems like the same issue we had with IOMMU reserved regions, I'd prefer that a user can figure out how to interact with the vfio interface through the vfio interface. Forcing the user to poke around in sysfs requires the user to have read permissions to sysfs in places they otherwise wouldn't need. Thanks, Alex > > From: Jean-Philippe Brucker [mailto:jean-phili...@linaro.org] > > Sent: Wednesday, November 13, 2019 6:29 PM > > To: Liu, Yi L > > Subject: Re: [RFC v2 3/3] vfio/type1: bind guest pasid (guest page tables) > > to host > > > > On Wed, Nov 13, 2019 at 07:43:43AM +, Liu, Yi L wrote: > > > > From: Alex Williamson > > > > Sent: Wednesday, November 13, 2019 1:26 AM > > > > To: Liu, Yi L > > > > Subject: Re: [RFC v2 3/3] vfio/type1: bind guest pasid (guest page > > > > tables) to host > > > > > > > > On Tue, 12 Nov 2019 11:21:40 + > > > > "Liu, Yi L" wrote: > > > > > > > > > > From: Alex Williamson < alex.william...@redhat.com > > > > > > > Sent: Friday, November 8, 2019 7:21 AM > > > > > > To: Liu, Yi L > > > > > > Subject: Re: [RFC v2 3/3] vfio/type1: bind guest pasid (guest page > > > > > > tables) to > > host > > > > > > > > > > > > On Thu, 24 Oct 2019 08:26:23 -0400 > > > > > > Liu Yi L wrote: > > > > > > > > > > > > > This patch adds vfio support to bind guest translation structure > > > > > > > to host iommu. VFIO exposes iommu programming capability to user- > > > > > > > space. Guest is a user-space application in host under KVM > > > > > > > solution. > > > > > > > For SVA usage in Virtual Machine, guest owns GVA->GPA translation > > > > > > > structure. And this part should be passdown to host to enable > > > > > > > nested > > > > > > > translation (or say two stage translation). This patch reuses the > > > > > > > VFIO_IOMMU_BIND proposal from Jean-Philippe Brucker, and a
Re: [PATCH v2 5/8] iommu/vt-d: Add first level page table interfaces
On Thu, 28 Nov 2019 10:25:47 +0800 Lu Baolu wrote: > This adds functions to manipulate first level page tables > which could be used by a scalale mode capable IOMMU unit. > FL and SL page tables are very similar, and I presume we are not using all the flag bits in FL paging structures for DMA mapping. Are there enough relevant differences to warrant a new set of helper functions for FL? Or we can merge into one. > Cc: Ashok Raj > Cc: Jacob Pan > Cc: Kevin Tian > Cc: Liu Yi L > Cc: Yi Sun > Signed-off-by: Lu Baolu > --- > drivers/iommu/Makefile | 2 +- > drivers/iommu/intel-iommu.c| 33 +++ > drivers/iommu/intel-pgtable.c | 376 > + include/linux/intel-iommu.h| > 33 ++- include/trace/events/intel_iommu.h | 60 + > 5 files changed, 502 insertions(+), 2 deletions(-) > create mode 100644 drivers/iommu/intel-pgtable.c > > diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile > index 35d17094fe3b..aa04f4c3ae26 100644 > --- a/drivers/iommu/Makefile > +++ b/drivers/iommu/Makefile > @@ -18,7 +18,7 @@ obj-$(CONFIG_ARM_SMMU) += arm-smmu.o arm-smmu-impl.o > obj-$(CONFIG_ARM_SMMU_V3) += arm-smmu-v3.o > obj-$(CONFIG_DMAR_TABLE) += dmar.o > obj-$(CONFIG_INTEL_IOMMU) += intel-iommu.o intel-pasid.o > -obj-$(CONFIG_INTEL_IOMMU) += intel-trace.o > +obj-$(CONFIG_INTEL_IOMMU) += intel-trace.o intel-pgtable.o > obj-$(CONFIG_INTEL_IOMMU_DEBUGFS) += intel-iommu-debugfs.o > obj-$(CONFIG_INTEL_IOMMU_SVM) += intel-svm.o > obj-$(CONFIG_IPMMU_VMSA) += ipmmu-vmsa.o > diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c > index 66f76f6df2c2..a314892ee72b 100644 > --- a/drivers/iommu/intel-iommu.c > +++ b/drivers/iommu/intel-iommu.c > @@ -1670,6 +1670,37 @@ static void free_dmar_iommu(struct intel_iommu > *iommu) #endif > } > > +/* First level 5-level paging support */ > +static bool first_lvl_5lp_support(void) > +{ > + struct dmar_drhd_unit *drhd; > + struct intel_iommu *iommu; > + static int first_level_5lp_supported = -1; > + > + if (likely(first_level_5lp_supported != -1)) > + return first_level_5lp_supported; > + > + first_level_5lp_supported = 1; > +#ifdef CONFIG_X86 > + /* Match IOMMU first level and CPU paging mode */ > + if (!cpu_feature_enabled(X86_FEATURE_LA57)) { > + first_level_5lp_supported = 0; > + return first_level_5lp_supported; > + } > +#endif /* #ifdef CONFIG_X86 */ > + > + rcu_read_lock(); > + for_each_active_iommu(iommu, drhd) { > + if (!cap_5lp_support(iommu->cap)) { > + first_level_5lp_supported = 0; > + break; > + } > + } > + rcu_read_unlock(); > + > + return first_level_5lp_supported; > +} > + > static struct dmar_domain *alloc_domain(int flags) > { > struct dmar_domain *domain; > @@ -1683,6 +1714,8 @@ static struct dmar_domain *alloc_domain(int > flags) domain->flags = flags; > domain->has_iotlb_device = false; > domain->ops = &second_lvl_pgtable_ops; > + domain->first_lvl_5lp = first_lvl_5lp_support(); > + spin_lock_init(&domain->page_table_lock); > INIT_LIST_HEAD(&domain->devices); > > return domain; > diff --git a/drivers/iommu/intel-pgtable.c > b/drivers/iommu/intel-pgtable.c new file mode 100644 > index ..4a26d08a7570 > --- /dev/null > +++ b/drivers/iommu/intel-pgtable.c > @@ -0,0 +1,376 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/** > + * intel-pgtable.c - Intel IOMMU page table manipulation library > + * > + * Copyright (C) 2019 Intel Corporation > + * > + * Author: Lu Baolu > + */ > + > +#define pr_fmt(fmt) "DMAR: " fmt > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +/* > + * first_lvl_map: Map a range of IO virtual address to physical > addresses. > + */ > +#ifdef CONFIG_X86 > +#define pgtable_populate(domain, > nm) \ +do > { > \ > + void *__new = > alloc_pgtable_page(domain->nid); \ > + if > (!__new) \ > + return > -ENOMEM; \ > + > smp_wmb();\ > + > spin_lock(&(domain)->page_table_lock); > \ > + if (nm ## _present(*nm)) > { \ > + > free_pgtable_page(__new); \ > + } else > { \ > + set_##nm(nm, __##nm(__pa(__new) | > _PAGE_TABLE));\ > + domain_flush_cache(domain, nm, > sizeof(nm##_t)); \ > + } > \ > + > spin_unlock(&(domain)->page_table_lock); \ +} > while (0) + > +static int > +first_lvl_map_pte_range(struct dmar_domain *domain, pmd_t *pmd, > + unsigned long addr
Re: [git pull] IOMMU Updates for Linux v5.5
The pull request you sent on Fri, 29 Nov 2019 13:02:51 +0100: > git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git > tags/iommu-updates-v5.5 has been merged into torvalds/linux.git: https://git.kernel.org/torvalds/c/1daa56bcfd8b329447e0c1b1e91c3925d08489b7 Thank you! -- Deet-doot-dot, I am a bot. https://korg.wiki.kernel.org/userdoc/prtracker ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v5 8/8] iommu/vt-d: Misc macro clean up for SVM
On Mon, 2019-12-02 at 11:58 -0800, Jacob Pan wrote: > Use combined macros for_each_svm_dev() to simplify SVM device iteration > and error checking. [] > diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c [] > @@ -427,40 +430,36 @@ int intel_svm_unbind_mm(struct device *dev, int pasid) [] > + for_each_svm_dev(sdev, svm, dev) { > + ret = 0; > + sdev->users--; > + if (!sdev->users) { This might be better by reducing indentation here too if (sdev->users) break; > + list_del_rcu(&sdev->list); to reduce indentation 1 level below this > + /* Flush the PASID cache and IOTLB for this device. > + * Note that we do depend on the hardware *not* using > + * the PASID any more. Just as we depend on other > + * devices never using PASIDs that they have no right > + * to use. We have a *shared* PASID table, because it's > + * large and has to be physically contiguous. So it's > + * hard to be as defensive as we might like. */ > + intel_pasid_tear_down_entry(iommu, dev, svm->pasid); > + intel_flush_svm_range_dev(svm, sdev, 0, -1, 0); > + kfree_rcu(sdev, rcu); > + > + if (list_empty(&svm->devs)) { > + ioasid_free(svm->pasid); > + if (svm->mm) > + mmu_notifier_unregister(&svm->notifier, > svm->mm); > + list_del(&svm->list); > + /* We mandate that no page faults may be > outstanding > + * for the PASID when intel_svm_unbind_mm() is > called. > + * If that is not obeyed, subtle errors will > happen. > + * Let's make them less subtle... */ > + memset(svm, 0x6b, sizeof(*svm)); > + kfree(svm); > } > - break; > } > + break; > } > out: > mutex_unlock(&pasid_mutex); ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 0/8] Use 1st-level for DMA remapping
On Thu, 28 Nov 2019 10:25:42 +0800 Lu Baolu wrote: > Intel VT-d in scalable mode supports two types of page talbes tables > for DMA translation: the first level page table and the second > level page table. The first level page table uses the same > format as the CPU page table, while the second level page table > keeps compatible with previous formats. The software is able > to choose any one of them for DMA remapping according to the use > case. > > This patchset aims to move IOVA (I/O Virtual Address) translation move guest IOVA only, right? > to 1st-level page table in scalable mode. This will simplify vIOMMU > (IOMMU simulated by VM hypervisor) design by using the two-stage > translation, a.k.a. nested mode translation. > > As Intel VT-d architecture offers caching mode, guest IOVA (GIOVA) > support is now implemented in a shadow page manner. The device > simulation software, like QEMU, has to figure out GIOVA->GPA mappings > and write them to a shadowed page table, which will be used by the > physical IOMMU. Each time when mappings are created or destroyed in > vIOMMU, the simulation software has to intervene. Hence, the changes > on GIOVA->GPA could be shadowed to host. > > > .---. > | vIOMMU | > |---| .. > | |IOTLB flush trap |QEMU| > .---. (map/unmap) || > |GIOVA->GPA |>|.. | > '---' || GIOVA->HPA | | > | | |'' | > '---' || >|| >'' > | > < > | > v VFIO/IOMMU API > .---. > | pIOMMU | > |---| > | | > .---. > |GIOVA->HPA | > '---' > | | > '---' > > In VT-d 3.0, scalable mode is introduced, which offers two-level > translation page tables and nested translation mode. Regards to > GIOVA support, it can be simplified by 1) moving the GIOVA support > over 1st-level page table to store GIOVA->GPA mapping in vIOMMU, > 2) binding vIOMMU 1st level page table to the pIOMMU, 3) using pIOMMU > second level for GPA->HPA translation, and 4) enable nested (a.k.a. > dual-stage) translation in host. Compared with current shadow GIOVA > support, the new approach makes the vIOMMU design simpler and more > efficient as we only need to flush the pIOMMU IOTLB and possible > device-IOTLB when an IOVA mapping in vIOMMU is torn down. > > .---. > | vIOMMU | > |---| .---. > | |IOTLB flush trap | QEMU| > .---.(unmap) |---| > |GIOVA->GPA |>| | > '---' '---' > | | | > '---' | ><-- >| VFIO/IOMMU >| cache invalidation and >| guest gpd bind interfaces >v > .---. > | pIOMMU | > |---| > .---. > |GIOVA->GPA |<---First level > '---' > | GPA->HPA |<---Scond level > '---' > '---' > > This patch set includes two parts. The former part implements the > per-domain page table abstraction, which makes the page table > difference transparent to various map/unmap APIs. The later part s/later/latter/ > applies the first level page table for IOVA translation unless the > DOMAIN_ATTR_NESTING domain attribution has been set, which indicates > nested mode in use. > Maybe I am reading this wrong, but shouldn't it be the opposite? i.e. Use FL page table for IOVA if it is a nesting domain? > Based-on-idea-by: Ashok Raj > Based-on-idea-by: Kevin Tian > Based-on-idea-by: Liu Yi L > Based-on-idea-by: Jacob Pan > Based-on-idea-by: Sanjay Kumar > Based-on-idea-by: Lu Baolu > > Change log: > > v1->v2 > - The first series was posted here >https://lkml.org/lkml/2019/9/23/297 > - Use per domain page table ops to handle different page tables. > - Use first level for DMA remapping by default on both bare metal >and vm guest. > - Code refine according to code review comments for v1. > > Lu Baolu (8): > iommu/vt-d: Add per domain page table ops > iommu/vt-d: Move domain_flush_cache helper into header > iommu/vt-d: Implement second level page table ops > iommu/vt-d: Apply per domain second level page table ops > iommu/vt-d: Add first level page table interfaces > iommu/vt-d: Implement first level page table ops > iommu/vt-d: Identify
Re: [PATCH 0/5] iommu/vt-d: Consolidate various cache flush ops
On Fri, 22 Nov 2019 11:04:44 +0800 Lu Baolu wrote: > Intel VT-d 3.0 introduces more caches and interfaces for software to > flush when it runs in the scalable mode. Currently various cache flush > helpers are scattered around. This consolidates them by putting them > in the existing iommu_flush structure. > > /* struct iommu_flush - Intel IOMMU cache invalidation ops > * > * @cc_inv: invalidate context cache > * @iotlb_inv: Invalidate IOTLB and paging structure caches when > software > * has changed second-level tables. > * @p_iotlb_inv: Invalidate IOTLB and paging structure caches when > software > * has changed first-level tables. > * @pc_inv: invalidate pasid cache > * @dev_tlb_inv: invalidate cached mappings used by > requests-without-PASID > * from the Device-TLB on a endpoint device. > * @p_dev_tlb_inv: invalidate cached mappings used by > requests-with-PASID > * from the Device-TLB on an endpoint device > */ > struct iommu_flush { > void (*cc_inv)(struct intel_iommu *iommu, u16 did, >u16 sid, u8 fm, u64 type); > void (*iotlb_inv)(struct intel_iommu *iommu, u16 did, u64 > addr, unsigned int size_order, u64 type); > void (*p_iotlb_inv)(struct intel_iommu *iommu, u16 did, u32 > pasid, u64 addr, unsigned long npages, bool ih); > void (*pc_inv)(struct intel_iommu *iommu, u16 did, u32 pasid, >u64 granu); > void (*dev_tlb_inv)(struct intel_iommu *iommu, u16 sid, u16 > pfsid, u16 qdep, u64 addr, unsigned int mask); > void (*p_dev_tlb_inv)(struct intel_iommu *iommu, u16 sid, u16 > pfsid, u32 pasid, u16 qdep, u64 addr, > unsigned long npages); > }; > > The name of each cache flush ops is defined according to the spec > section 6.5 so that people are easy to look up them in the spec. > Nice consolidation. For nested SVM, I also introduced cache flushed helpers as needed. https://lkml.org/lkml/2019/10/24/857 Should I wait for yours to be merged or you want to extend the this consolidation after SVA/SVM cache flush? I expect to send my v8 shortly. > Best regards, > Lu Baolu > > Lu Baolu (5): > iommu/vt-d: Extend iommu_flush for scalable mode > iommu/vt-d: Consolidate pasid cache invalidation > iommu/vt-d: Consolidate device tlb invalidation > iommu/vt-d: Consolidate pasid-based tlb invalidation > iommu/vt-d: Consolidate pasid-based device tlb invalidation > > drivers/iommu/dmar.c| 61 - > drivers/iommu/intel-iommu.c | 246 > +--- drivers/iommu/intel-pasid.c | > 39 +- drivers/iommu/intel-svm.c | 60 ++--- > include/linux/intel-iommu.h | 39 -- > 5 files changed, 244 insertions(+), 201 deletions(-) > [Jacob Pan] ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v5 2/8] iommu/vt-d: Match CPU and IOMMU paging mode
When setting up first level page tables for sharing with CPU, we need to ensure IOMMU can support no less than the levels supported by the CPU. It is not adequate, as in the current code, to set up 5-level paging in PASID entry First Level Paging Mode(FLPM) solely based on CPU. Currently, intel_pasid_setup_first_level() is only used by native SVM code which already checks paging mode matches. However, future use of this helper function may not be limited to native SVM. https://lkml.org/lkml/2019/11/18/1037 Fixes: 437f35e1cd4c8 ("iommu/vt-d: Add first level page table interface") Signed-off-by: Jacob Pan Reviewed-by: Eric Auger Acked-by: Lu Baolu --- drivers/iommu/intel-pasid.c | 12 ++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/drivers/iommu/intel-pasid.c b/drivers/iommu/intel-pasid.c index 040a445be300..e7cb0b8a7332 100644 --- a/drivers/iommu/intel-pasid.c +++ b/drivers/iommu/intel-pasid.c @@ -499,8 +499,16 @@ int intel_pasid_setup_first_level(struct intel_iommu *iommu, } #ifdef CONFIG_X86 - if (cpu_feature_enabled(X86_FEATURE_LA57)) - pasid_set_flpm(pte, 1); + /* Both CPU and IOMMU paging mode need to match */ + if (cpu_feature_enabled(X86_FEATURE_LA57)) { + if (cap_5lp_support(iommu->cap)) { + pasid_set_flpm(pte, 1); + } else { + pr_err("VT-d has no 5-level paging support for CPU\n"); + pasid_clear_entry(pte); + return -EINVAL; + } + } #endif /* CONFIG_X86 */ pasid_set_domain_id(pte, did); -- 2.7.4 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v5 0/8] VT-d Native Shared virtual memory cleanup and fixes
Mostly extracted from nested SVA/SVM series based on review comments of v7. https://lkml.org/lkml/2019/10/24/852 This series also adds a few important fixes for native use of SVA. Nested SVA new code will be submitted separately as a smaller set. Based on the core branch of IOMMU tree staged for v5.5, where common APIs for vSVA were applied. git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git core Changelog: v5 - Regrouped patch 6 and 8, added comments suggested by Joe Perches v4 - Commit message fix V3 - Squashed 1/10 & 2/10 - Deleted "8/10 Fix PASID cache flush" from this series - Addressed reviews from Eric Auger and Baolu V2 - Coding style fixes based on Baolu's input, no functional change - Added Acked-by tags. Thanks, Jacob *** BLURB HERE *** Jacob Pan (8): iommu/vt-d: Fix CPU and IOMMU SVM feature matching checks iommu/vt-d: Match CPU and IOMMU paging mode iommu/vt-d: Reject SVM bind for failed capability check iommu/vt-d: Avoid duplicated code for PASID setup iommu/vt-d: Fix off-by-one in PASID allocation iommu/vt-d: Replace Intel specific PASID allocator with IOASID iommu/vt-d: Avoid sending invalid page response iommu/vt-d: Misc macro clean up for SVM drivers/iommu/Kconfig | 1 + drivers/iommu/intel-iommu.c | 23 +++ drivers/iommu/intel-pasid.c | 96 -- drivers/iommu/intel-svm.c | 163 +--- include/linux/intel-iommu.h | 5 +- 5 files changed, 135 insertions(+), 153 deletions(-) -- 2.7.4 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v5 3/8] iommu/vt-d: Reject SVM bind for failed capability check
Add a check during SVM bind to ensure CPU and IOMMU hardware capabilities are met. Signed-off-by: Jacob Pan Reviewed-by: Eric Auger Acked-by: Lu Baolu --- drivers/iommu/intel-svm.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c index 716c543488f6..74df10a39dfc 100644 --- a/drivers/iommu/intel-svm.c +++ b/drivers/iommu/intel-svm.c @@ -238,6 +238,9 @@ int intel_svm_bind_mm(struct device *dev, int *pasid, int flags, struct svm_dev_ if (!iommu || dmar_disabled) return -EINVAL; + if (!intel_svm_capable(iommu)) + return -ENOTSUPP; + if (dev_is_pci(dev)) { pasid_max = pci_max_pasids(to_pci_dev(dev)); if (pasid_max < 0) -- 2.7.4 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v5 8/8] iommu/vt-d: Misc macro clean up for SVM
Use combined macros for_each_svm_dev() to simplify SVM device iteration and error checking. Suggested-by: Andy Shevchenko Signed-off-by: Jacob Pan Reviewed-by: Eric Auger Acked-by: Lu Baolu --- drivers/iommu/intel-svm.c | 79 +++ 1 file changed, 39 insertions(+), 40 deletions(-) diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c index 4ed7426473c1..0fcbe631cd5f 100644 --- a/drivers/iommu/intel-svm.c +++ b/drivers/iommu/intel-svm.c @@ -226,6 +226,10 @@ static const struct mmu_notifier_ops intel_mmuops = { static DEFINE_MUTEX(pasid_mutex); static LIST_HEAD(global_svm_list); +#define for_each_svm_dev(sdev, svm, d) \ + list_for_each_entry((sdev), &(svm)->devs, list) \ + if ((d) != (sdev)->dev) {} else + int intel_svm_bind_mm(struct device *dev, int *pasid, int flags, struct svm_dev_ops *ops) { struct intel_iommu *iommu = intel_svm_device_to_iommu(dev); @@ -274,15 +278,14 @@ int intel_svm_bind_mm(struct device *dev, int *pasid, int flags, struct svm_dev_ goto out; } - list_for_each_entry(sdev, &svm->devs, list) { - if (dev == sdev->dev) { - if (sdev->ops != ops) { - ret = -EBUSY; - goto out; - } - sdev->users++; - goto success; + /* Find the matching device in svm list */ + for_each_svm_dev(sdev, svm, dev) { + if (sdev->ops != ops) { + ret = -EBUSY; + goto out; } + sdev->users++; + goto success; } break; @@ -427,40 +430,36 @@ int intel_svm_unbind_mm(struct device *dev, int pasid) goto out; } - list_for_each_entry(sdev, &svm->devs, list) { - if (dev == sdev->dev) { - ret = 0; - sdev->users--; - if (!sdev->users) { - list_del_rcu(&sdev->list); - /* Flush the PASID cache and IOTLB for this device. -* Note that we do depend on the hardware *not* using -* the PASID any more. Just as we depend on other -* devices never using PASIDs that they have no right -* to use. We have a *shared* PASID table, because it's -* large and has to be physically contiguous. So it's -* hard to be as defensive as we might like. */ - intel_pasid_tear_down_entry(iommu, dev, svm->pasid); - intel_flush_svm_range_dev(svm, sdev, 0, -1, 0); - kfree_rcu(sdev, rcu); - - if (list_empty(&svm->devs)) { - ioasid_free(svm->pasid); - if (svm->mm) - mmu_notifier_unregister(&svm->notifier, svm->mm); - - list_del(&svm->list); - - /* We mandate that no page faults may be outstanding -* for the PASID when intel_svm_unbind_mm() is called. -* If that is not obeyed, subtle errors will happen. -* Let's make them less subtle... */ - memset(svm, 0x6b, sizeof(*svm)); - kfree(svm); - } + for_each_svm_dev(sdev, svm, dev) { + ret = 0; + sdev->users--; + if (!sdev->users) { + list_del_rcu(&sdev->list); + /* Flush the PASID cache and IOTLB for this device. +* Note that we do depend on the hardware *not* using +* the PASID any more. Just as we depend on other +* devices never using PASIDs that they have no right +* to use. We have a *shared* PASID table, because it's +* large and has to be physically contiguous. So it's +* hard to be as defensive as we might like. */ + intel_pasid_tear_down_entry(iommu, dev, svm->pasid); + intel_flush_svm_range
[PATCH v5 6/8] iommu/vt-d: Replace Intel specific PASID allocator with IOASID
Make use of generic IOASID code to manage PASID allocation, free, and lookup. Replace Intel specific code. Signed-off-by: Jacob Pan Reviewed-by: Lu Baolu Reviewed-by: Eric Auger Acked-by: Lu Baolu --- drivers/iommu/Kconfig | 1 + drivers/iommu/intel-iommu.c | 13 +++-- drivers/iommu/intel-pasid.c | 36 drivers/iommu/intel-svm.c | 36 ++-- 4 files changed, 30 insertions(+), 56 deletions(-) diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig index fd50ddbf..43ce450a40d3 100644 --- a/drivers/iommu/Kconfig +++ b/drivers/iommu/Kconfig @@ -212,6 +212,7 @@ config INTEL_IOMMU_SVM depends on INTEL_IOMMU && X86 select PCI_PASID select MMU_NOTIFIER + select IOASID help Shared Virtual Memory (SVM) provides a facility for devices to access DMA resources through process address space by diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c index d598168e410d..a84f0caa33a0 100644 --- a/drivers/iommu/intel-iommu.c +++ b/drivers/iommu/intel-iommu.c @@ -5238,7 +5238,7 @@ static void auxiliary_unlink_device(struct dmar_domain *domain, domain->auxd_refcnt--; if (!domain->auxd_refcnt && domain->default_pasid > 0) - intel_pasid_free_id(domain->default_pasid); + ioasid_free(domain->default_pasid); } static int aux_domain_add_dev(struct dmar_domain *domain, @@ -5256,10 +5256,11 @@ static int aux_domain_add_dev(struct dmar_domain *domain, if (domain->default_pasid <= 0) { int pasid; - pasid = intel_pasid_alloc_id(domain, PASID_MIN, -pci_max_pasids(to_pci_dev(dev)), -GFP_KERNEL); - if (pasid <= 0) { + /* No private data needed for the default pasid */ + pasid = ioasid_alloc(NULL, PASID_MIN, +pci_max_pasids(to_pci_dev(dev)) - 1, +NULL); + if (pasid == INVALID_IOASID) { pr_err("Can't allocate default pasid\n"); return -ENODEV; } @@ -5295,7 +5296,7 @@ static int aux_domain_add_dev(struct dmar_domain *domain, spin_unlock(&iommu->lock); spin_unlock_irqrestore(&device_domain_lock, flags); if (!domain->auxd_refcnt && domain->default_pasid > 0) - intel_pasid_free_id(domain->default_pasid); + ioasid_free(domain->default_pasid); return ret; } diff --git a/drivers/iommu/intel-pasid.c b/drivers/iommu/intel-pasid.c index 732bfee228df..3cb569e76642 100644 --- a/drivers/iommu/intel-pasid.c +++ b/drivers/iommu/intel-pasid.c @@ -26,42 +26,6 @@ */ static DEFINE_SPINLOCK(pasid_lock); u32 intel_pasid_max_id = PASID_MAX; -static DEFINE_IDR(pasid_idr); - -int intel_pasid_alloc_id(void *ptr, int start, int end, gfp_t gfp) -{ - int ret, min, max; - - min = max_t(int, start, PASID_MIN); - max = min_t(int, end, intel_pasid_max_id); - - WARN_ON(in_interrupt()); - idr_preload(gfp); - spin_lock(&pasid_lock); - ret = idr_alloc(&pasid_idr, ptr, min, max, GFP_ATOMIC); - spin_unlock(&pasid_lock); - idr_preload_end(); - - return ret; -} - -void intel_pasid_free_id(int pasid) -{ - spin_lock(&pasid_lock); - idr_remove(&pasid_idr, pasid); - spin_unlock(&pasid_lock); -} - -void *intel_pasid_lookup_id(int pasid) -{ - void *p; - - spin_lock(&pasid_lock); - p = idr_find(&pasid_idr, pasid); - spin_unlock(&pasid_lock); - - return p; -} /* * Per device pasid table management: diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c index e90d0b914afe..62f815e0ae57 100644 --- a/drivers/iommu/intel-svm.c +++ b/drivers/iommu/intel-svm.c @@ -17,6 +17,7 @@ #include #include #include +#include #include #include "intel-pasid.h" @@ -335,16 +336,15 @@ int intel_svm_bind_mm(struct device *dev, int *pasid, int flags, struct svm_dev_ if (pasid_max > intel_pasid_max_id) pasid_max = intel_pasid_max_id; - /* Do not use PASID 0 in caching mode (virtualised IOMMU) */ - ret = intel_pasid_alloc_id(svm, - !!cap_caching_mode(iommu->cap), - pasid_max, GFP_KERNEL); - if (ret < 0) { + /* Do not use PASID 0, reserved for RID to PASID */ + svm->pasid = ioasid_alloc(NULL, PASID_MIN, + pasid_max - 1, svm); + if (svm->pasid == INVALID_IOASID) { kfree(svm); kfree(sdev); + ret = -ENOSPC; goto out; } -
[PATCH v5 7/8] iommu/vt-d: Avoid sending invalid page response
Page responses should only be sent when last page in group (LPIG) or private data is present in the page request. This patch avoids sending invalid descriptors. Fixes: 5d308fc1ecf53 ("iommu/vt-d: Add 256-bit invalidation descriptor support") Signed-off-by: Jacob Pan Reviewed-by: Eric Auger Acked-by: Lu Baolu --- drivers/iommu/intel-svm.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c index 62f815e0ae57..4ed7426473c1 100644 --- a/drivers/iommu/intel-svm.c +++ b/drivers/iommu/intel-svm.c @@ -683,11 +683,10 @@ static irqreturn_t prq_event_thread(int irq, void *d) if (req->priv_data_present) memcpy(&resp.qw2, req->priv_data, sizeof(req->priv_data)); + resp.qw2 = 0; + resp.qw3 = 0; + qi_submit_sync(&resp, iommu); } - resp.qw2 = 0; - resp.qw3 = 0; - qi_submit_sync(&resp, iommu); - head = (head + sizeof(*req)) & PRQ_RING_MASK; } -- 2.7.4 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v5 4/8] iommu/vt-d: Avoid duplicated code for PASID setup
After each setup for PASID entry, related translation caches must be flushed. We can combine duplicated code into one function which is less error prone. Signed-off-by: Jacob Pan Reviewed-by: Lu Baolu Reviewed-by: Eric Auger Acked-by: Lu Baolu --- drivers/iommu/intel-pasid.c | 48 + 1 file changed, 18 insertions(+), 30 deletions(-) diff --git a/drivers/iommu/intel-pasid.c b/drivers/iommu/intel-pasid.c index e7cb0b8a7332..732bfee228df 100644 --- a/drivers/iommu/intel-pasid.c +++ b/drivers/iommu/intel-pasid.c @@ -465,6 +465,21 @@ void intel_pasid_tear_down_entry(struct intel_iommu *iommu, devtlb_invalidation_with_pasid(iommu, dev, pasid); } +static void pasid_flush_caches(struct intel_iommu *iommu, + struct pasid_entry *pte, + int pasid, u16 did) +{ + if (!ecap_coherent(iommu->ecap)) + clflush_cache_range(pte, sizeof(*pte)); + + if (cap_caching_mode(iommu->cap)) { + pasid_cache_invalidation_with_pasid(iommu, did, pasid); + iotlb_invalidation_with_pasid(iommu, did, pasid); + } else { + iommu_flush_write_buffer(iommu); + } +} + /* * Set up the scalable mode pasid table entry for first only * translation type. @@ -518,16 +533,7 @@ int intel_pasid_setup_first_level(struct intel_iommu *iommu, /* Setup Present and PASID Granular Transfer Type: */ pasid_set_translation_type(pte, 1); pasid_set_present(pte); - - if (!ecap_coherent(iommu->ecap)) - clflush_cache_range(pte, sizeof(*pte)); - - if (cap_caching_mode(iommu->cap)) { - pasid_cache_invalidation_with_pasid(iommu, did, pasid); - iotlb_invalidation_with_pasid(iommu, did, pasid); - } else { - iommu_flush_write_buffer(iommu); - } + pasid_flush_caches(iommu, pte, pasid, did); return 0; } @@ -591,16 +597,7 @@ int intel_pasid_setup_second_level(struct intel_iommu *iommu, */ pasid_set_sre(pte); pasid_set_present(pte); - - if (!ecap_coherent(iommu->ecap)) - clflush_cache_range(pte, sizeof(*pte)); - - if (cap_caching_mode(iommu->cap)) { - pasid_cache_invalidation_with_pasid(iommu, did, pasid); - iotlb_invalidation_with_pasid(iommu, did, pasid); - } else { - iommu_flush_write_buffer(iommu); - } + pasid_flush_caches(iommu, pte, pasid, did); return 0; } @@ -634,16 +631,7 @@ int intel_pasid_setup_pass_through(struct intel_iommu *iommu, */ pasid_set_sre(pte); pasid_set_present(pte); - - if (!ecap_coherent(iommu->ecap)) - clflush_cache_range(pte, sizeof(*pte)); - - if (cap_caching_mode(iommu->cap)) { - pasid_cache_invalidation_with_pasid(iommu, did, pasid); - iotlb_invalidation_with_pasid(iommu, did, pasid); - } else { - iommu_flush_write_buffer(iommu); - } + pasid_flush_caches(iommu, pte, pasid, did); return 0; } -- 2.7.4 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v5 5/8] iommu/vt-d: Fix off-by-one in PASID allocation
PASID allocator uses IDR which is exclusive for the end of the allocation range. There is no need to decrement pasid_max. Fixes: af39507305fb ("iommu/vt-d: Apply global PASID in SVA") Reported-by: Eric Auger Signed-off-by: Jacob Pan Reviewed-by: Eric Auger Acked-by: Lu Baolu --- drivers/iommu/intel-svm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c index 74df10a39dfc..e90d0b914afe 100644 --- a/drivers/iommu/intel-svm.c +++ b/drivers/iommu/intel-svm.c @@ -338,7 +338,7 @@ int intel_svm_bind_mm(struct device *dev, int *pasid, int flags, struct svm_dev_ /* Do not use PASID 0 in caching mode (virtualised IOMMU) */ ret = intel_pasid_alloc_id(svm, !!cap_caching_mode(iommu->cap), - pasid_max - 1, GFP_KERNEL); + pasid_max, GFP_KERNEL); if (ret < 0) { kfree(svm); kfree(sdev); -- 2.7.4 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v5 1/8] iommu/vt-d: Fix CPU and IOMMU SVM feature matching checks
Shared Virtual Memory(SVM) is based on a collective set of hardware features detected at runtime. There are requirements for matching CPU and IOMMU capabilities. The current code checks CPU and IOMMU feature set for SVM support but the result is never stored nor used. Therefore, SVM can still be used even when these checks failed. The consequences can be: 1. CPU uses 5-level paging mode for virtual address of 57 bits, but IOMMU can only support 4-level paging mode with 48 bits address for DMA. 2. 1GB page size is used by CPU but IOMMU does not support it. VT-d unrecoverable faults may be generated. The best solution to fix these problems is to prevent them in the first place. This patch consolidates code for checking PASID, CPU vs. IOMMU paging mode compatibility, as well as provides specific error messages for each failed checks. On sane hardware configurations, these error message shall never appear in kernel log. Signed-off-by: Jacob Pan Reviewed-by: Eric Auger Acked-by: Lu Baolu --- drivers/iommu/intel-iommu.c | 10 ++ drivers/iommu/intel-svm.c | 40 +++- include/linux/intel-iommu.h | 5 - 3 files changed, 33 insertions(+), 22 deletions(-) diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c index 3f974919d3bd..d598168e410d 100644 --- a/drivers/iommu/intel-iommu.c +++ b/drivers/iommu/intel-iommu.c @@ -3289,10 +3289,7 @@ static int __init init_dmars(void) if (!ecap_pass_through(iommu->ecap)) hw_pass_through = 0; -#ifdef CONFIG_INTEL_IOMMU_SVM - if (pasid_supported(iommu)) - intel_svm_init(iommu); -#endif + intel_svm_check(iommu); } /* @@ -4471,10 +4468,7 @@ static int intel_iommu_add(struct dmar_drhd_unit *dmaru) if (ret) goto out; -#ifdef CONFIG_INTEL_IOMMU_SVM - if (pasid_supported(iommu)) - intel_svm_init(iommu); -#endif + intel_svm_check(iommu); if (dmaru->ignored) { /* diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c index 9b159132405d..716c543488f6 100644 --- a/drivers/iommu/intel-svm.c +++ b/drivers/iommu/intel-svm.c @@ -23,19 +23,6 @@ static irqreturn_t prq_event_thread(int irq, void *d); -int intel_svm_init(struct intel_iommu *iommu) -{ - if (cpu_feature_enabled(X86_FEATURE_GBPAGES) && - !cap_fl1gp_support(iommu->cap)) - return -EINVAL; - - if (cpu_feature_enabled(X86_FEATURE_LA57) && - !cap_5lp_support(iommu->cap)) - return -EINVAL; - - return 0; -} - #define PRQ_ORDER 0 int intel_svm_enable_prq(struct intel_iommu *iommu) @@ -99,6 +86,33 @@ int intel_svm_finish_prq(struct intel_iommu *iommu) return 0; } +static inline bool intel_svm_capable(struct intel_iommu *iommu) +{ + return iommu->flags & VTD_FLAG_SVM_CAPABLE; +} + +void intel_svm_check(struct intel_iommu *iommu) +{ + if (!pasid_supported(iommu)) + return; + + if (cpu_feature_enabled(X86_FEATURE_GBPAGES) && + !cap_fl1gp_support(iommu->cap)) { + pr_err("%s SVM disabled, incompatible 1GB page capability\n", + iommu->name); + return; + } + + if (cpu_feature_enabled(X86_FEATURE_LA57) && + !cap_5lp_support(iommu->cap)) { + pr_err("%s SVM disabled, incompatible paging mode\n", + iommu->name); + return; + } + + iommu->flags |= VTD_FLAG_SVM_CAPABLE; +} + static void intel_flush_svm_range_dev (struct intel_svm *svm, struct intel_svm_dev *sdev, unsigned long address, unsigned long pages, int ih) { diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h index ed11ef594378..7dcfa1c4a844 100644 --- a/include/linux/intel-iommu.h +++ b/include/linux/intel-iommu.h @@ -433,6 +433,7 @@ enum { #define VTD_FLAG_TRANS_PRE_ENABLED (1 << 0) #define VTD_FLAG_IRQ_REMAP_PRE_ENABLED (1 << 1) +#define VTD_FLAG_SVM_CAPABLE (1 << 2) extern int intel_iommu_sm; @@ -656,7 +657,7 @@ void iommu_flush_write_buffer(struct intel_iommu *iommu); int intel_iommu_enable_pasid(struct intel_iommu *iommu, struct device *dev); #ifdef CONFIG_INTEL_IOMMU_SVM -int intel_svm_init(struct intel_iommu *iommu); +extern void intel_svm_check(struct intel_iommu *iommu); extern int intel_svm_enable_prq(struct intel_iommu *iommu); extern int intel_svm_finish_prq(struct intel_iommu *iommu); @@ -684,6 +685,8 @@ struct intel_svm { }; extern struct intel_iommu *intel_svm_device_to_iommu(struct device *dev); +#else +static inline void intel_svm_check(struct intel_iommu *iommu) {} #endif #ifdef CONFIG_INTEL_IOMMU_DEBUGFS -- 2.7.4 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundati
Re: [PATCH v4 8/8] iommu/vt-d: Misc macro clean up for SVM
On Mon, 02 Dec 2019 10:22:13 -0800 Joe Perches wrote: > On Mon, 2019-12-02 at 10:15 -0800, Jacob Pan wrote: > > On Thu, 21 Nov 2019 13:37:10 -0800 > > Joe Perches wrote: > > > > > On Thu, 2019-11-21 at 13:26 -0800, Jacob Pan wrote: > > > > Use combined macros for_each_svm_dev() to simplify SVM device > > > > iteration and error checking. > > > [] > > > > diff --git a/drivers/iommu/intel-svm.c > > > > b/drivers/iommu/intel-svm.c > > > [] > > > > +#define for_each_svm_dev(sdev, svm, d) \ > > > > + list_for_each_entry((sdev), &(svm)->devs, list) > > > > \ > > > > + if ((d) != (sdev)->dev) {} else > > > > + > > > > int intel_svm_bind_mm(struct device *dev, int *pasid, int > > > > flags, struct svm_dev_ops *ops) { > > > > struct intel_iommu *iommu = > > > > intel_svm_device_to_iommu(dev); @@ -274,15 +278,13 @@ int > > > > intel_svm_bind_mm(struct device *dev, int *pasid, int flags, > > > > struct svm_dev_ goto out; } > > > > > > > > - list_for_each_entry(sdev, &svm->devs, > > > > list) { > > > > - if (dev == sdev->dev) { > > > > - if (sdev->ops != ops) { > > > > - ret = -EBUSY; > > > > - goto out; > > > > - } > > > > - sdev->users++; > > > > - goto success; > > > > + for_each_svm_dev(sdev, svm, dev) { > > > > + if (sdev->ops != ops) { > > > > + ret = -EBUSY; > > > > + goto out; > > > > } > > > > + sdev->users++; > > > > + goto success; > > > > } > > > > > > I think this does not read better as this is now a > > > for_each loop that exits the loop on the first match. > > > > > I think one of the benefits is reduced indentation. What do you > > recommend? > > Making the code intelligible for a reader. > > At least add a comment describing why there is only > a single possible match. > > Given the for_each name, it's odd code that only the > first match has an action. > I will add a comment to explain we are trying to find the matching device on the list. Thanks ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v4 8/8] iommu/vt-d: Misc macro clean up for SVM
On Mon, 2019-12-02 at 10:15 -0800, Jacob Pan wrote: > On Thu, 21 Nov 2019 13:37:10 -0800 > Joe Perches wrote: > > > On Thu, 2019-11-21 at 13:26 -0800, Jacob Pan wrote: > > > Use combined macros for_each_svm_dev() to simplify SVM device > > > iteration and error checking. > > [] > > > diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c > > [] > > > +#define for_each_svm_dev(sdev, svm, d) \ > > > + list_for_each_entry((sdev), &(svm)->devs, list) \ > > > + if ((d) != (sdev)->dev) {} else > > > + > > > int intel_svm_bind_mm(struct device *dev, int *pasid, int flags, > > > struct svm_dev_ops *ops) { > > > struct intel_iommu *iommu = intel_svm_device_to_iommu(dev); > > > @@ -274,15 +278,13 @@ int intel_svm_bind_mm(struct device *dev, int > > > *pasid, int flags, struct svm_dev_ goto out; > > > } > > > > > > - list_for_each_entry(sdev, &svm->devs, > > > list) { > > > - if (dev == sdev->dev) { > > > - if (sdev->ops != ops) { > > > - ret = -EBUSY; > > > - goto out; > > > - } > > > - sdev->users++; > > > - goto success; > > > + for_each_svm_dev(sdev, svm, dev) { > > > + if (sdev->ops != ops) { > > > + ret = -EBUSY; > > > + goto out; > > > } > > > + sdev->users++; > > > + goto success; > > > } > > > > I think this does not read better as this is now a > > for_each loop that exits the loop on the first match. > > > I think one of the benefits is reduced indentation. What do you > recommend? Making the code intelligible for a reader. At least add a comment describing why there is only a single possible match. Given the for_each name, it's odd code that only the first match has an action. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v4 8/8] iommu/vt-d: Misc macro clean up for SVM
On Thu, 21 Nov 2019 13:37:10 -0800 Joe Perches wrote: > On Thu, 2019-11-21 at 13:26 -0800, Jacob Pan wrote: > > Use combined macros for_each_svm_dev() to simplify SVM device > > iteration and error checking. > [] > > diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c > [] > > +#define for_each_svm_dev(sdev, svm, d) \ > > + list_for_each_entry((sdev), &(svm)->devs, list) \ > > + if ((d) != (sdev)->dev) {} else > > + > > int intel_svm_bind_mm(struct device *dev, int *pasid, int flags, > > struct svm_dev_ops *ops) { > > struct intel_iommu *iommu = intel_svm_device_to_iommu(dev); > > @@ -274,15 +278,13 @@ int intel_svm_bind_mm(struct device *dev, int > > *pasid, int flags, struct svm_dev_ goto out; > > } > > > > - list_for_each_entry(sdev, &svm->devs, > > list) { > > - if (dev == sdev->dev) { > > - if (sdev->ops != ops) { > > - ret = -EBUSY; > > - goto out; > > - } > > - sdev->users++; > > - goto success; > > + for_each_svm_dev(sdev, svm, dev) { > > + if (sdev->ops != ops) { > > + ret = -EBUSY; > > + goto out; > > } > > + sdev->users++; > > + goto success; > > } > > I think this does not read better as this is now a > for_each loop that exits the loop on the first match. > I think one of the benefits is reduced indentation. What do you recommend? > > > > break; > > @@ -427,43 +429,36 @@ int intel_svm_unbind_mm(struct device *dev, > > int pasid) goto out; > > } > > > > - if (!svm) > > - goto out; > > - > > - list_for_each_entry(sdev, &svm->devs, list) { > [] > > + for_each_svm_dev(sdev, svm, dev) { > > I think this should not remove the !svm test above. > Yeah, !svm test should have been part of 6/8. I will fix that. Thanks, Jacob > [Jacob Pan] ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v5 2/3] PCI: Add parameter nr_devfns to pci_add_dma_alias
On Fri, Nov 29, 2019 at 05:56:55PM +, James Sewart wrote: > pci_add_dma_alias can now be used to create a dma alias for a range of > devfns. > > Signed-off-by: James Sewart > --- > drivers/pci/pci.c| 23 ++- > drivers/pci/quirks.c | 14 +++--- > include/linux/pci.h | 2 +- > 3 files changed, 26 insertions(+), 13 deletions(-) > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index 0a4449a30ace..f9800a610ca1 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -5857,7 +5857,8 @@ int pci_set_vga_state(struct pci_dev *dev, bool decode, > /** > * pci_add_dma_alias - Add a DMA devfn alias for a device > * @dev: the PCI device for which alias is added > - * @devfn: alias slot and function > + * @devfn_from: alias slot and function > + * @nr_devfns: Number of subsequent devfns to alias > * > * This helper encodes an 8-bit devfn as a bit number in dma_alias_mask > * which is used to program permissible bus-devfn source addresses for DMA > @@ -5873,8 +5874,14 @@ int pci_set_vga_state(struct pci_dev *dev, bool decode, > * cannot be left as a userspace activity). DMA aliases should therefore > * be configured via quirks, such as the PCI fixup header quirk. > */ > -void pci_add_dma_alias(struct pci_dev *dev, u8 devfn) > +void pci_add_dma_alias(struct pci_dev *dev, u8 devfn_from, unsigned > nr_devfns) > { > + int devfn_to; > + > + if (nr_devfns > U8_MAX+1) > + nr_devfns = U8_MAX+1; Missing whitespaces here as well. Also this could use max() and I think you want a documented constants for MAX_NR_DEVFNS that documents this "not off by one". ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v5 1/3] PCI: Fix off by one in dma_alias_mask allocation size
On Fri, Nov 29, 2019 at 05:56:19PM +, James Sewart wrote: > The number of possible devfns is 256 so the size of the bitmap for > allocations should be U8_MAX+1. > > Signed-off-by: James Sewart > --- > drivers/pci/pci.c| 2 +- > drivers/pci/search.c | 2 +- > 2 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index a97e2571a527..0a4449a30ace 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -5876,7 +5876,7 @@ int pci_set_vga_state(struct pci_dev *dev, bool decode, > void pci_add_dma_alias(struct pci_dev *dev, u8 devfn) > { > if (!dev->dma_alias_mask) > - dev->dma_alias_mask = bitmap_zalloc(U8_MAX, GFP_KERNEL); > + dev->dma_alias_mask = bitmap_zalloc(U8_MAX+1, GFP_KERNEL); Missing whitespaces around the "+". > - for_each_set_bit(devfn, pdev->dma_alias_mask, U8_MAX) { > + for_each_set_bit(devfn, pdev->dma_alias_mask, U8_MAX+1) { Same here. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2] iommu/amd: Disable IOMMU on Stoney Ridge systems
On Fri, Nov 29, 2019 at 10:21:54PM +0800, Kai-Heng Feng wrote: > Serious screen flickering when Stoney Ridge outputs to a 4K monitor. > > According to Alex Deucher, IOMMU isn't enabled on Windows, so let's do > the same here to avoid screen flickering on 4K monitor. Disabling the IOMMU entirely seem pretty severe. Isn't it enough to identity map the GPU device? ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [Patch v2 2/3] iommu: optimize iova_magazine_free_pfns()
> + return (mag && mag->size == IOVA_MAG_SIZE); > + return (!mag || mag->size == 0); No need for the braces in both cases. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [Patch v2 1/3] iommu: match the original algorithm
I think a subject line better describes what you change, no that it matches an original algorithm. The fact that the fix matches the original algorithm can go somewhere towards the commit log, preferably with a reference to the actual paper. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH v2] iommu/amd: Disable IOMMU on Stoney Ridge systems
> -Original Message- > From: Lucas Stach > Sent: Sunday, December 1, 2019 7:43 AM > To: Kai-Heng Feng ; j...@8bytes.org > Cc: Deucher, Alexander ; > iommu@lists.linux-foundation.org; linux-ker...@vger.kernel.org > Subject: Re: [PATCH v2] iommu/amd: Disable IOMMU on Stoney Ridge > systems > > Am Freitag, den 29.11.2019, 22:21 +0800 schrieb Kai-Heng Feng: > > Serious screen flickering when Stoney Ridge outputs to a 4K monitor. > > > > According to Alex Deucher, IOMMU isn't enabled on Windows, so let's do > > the same here to avoid screen flickering on 4K monitor. > > This doesn't seem like a good solution, especially if there isn't a method for > the user to opt-out. Some users might prefer having the IOMMU support to > 4K display output. > > But before using the big hammer of disabling or breaking one of those > features, we should take a look at what's the issue here. Screen flickering > caused by the IOMMU being active hints to the IOMMU not being able to > sustain the translation bandwidth required by the high- bandwidth > isochronous transfers caused by 4K scanout, most likely due to insufficient > TLB space. > > As far as I know the framebuffer memory for the display buffers is located in > stolen RAM, and thus contigous in memory. I don't know the details of the > GPU integration on those APUs, but maybe there even is a way to bypass the > IOMMU for the stolen VRAM regions? > > If there isn't and all GPU traffic passes through the IOMMU when active, we > should check if the stolen RAM is mapped with hugepages on the IOMMU > side. All the stolen RAM can most likely be mapped with a few hugepage > mappings, which should reduce IOMMU TLB demand by a large margin. The is no issue when we scan out of the carve out region. The issue occurs when we scan out of regular system memory (scatter/gather). Many newer laptops have very small carve out regions (e.g., 32 MB), so we have to use regular system pages to support multiple high resolution displays. The problem is, the latency gets too high at some point when the IOMMU is involved. Huge pages would probably help in this case, but I'm not sure if there is any way to guarantee that we get huge pages for system memory. I guess we could use CMA or something like that. Alex > > Regards, > Lucas > > > Cc: Alex Deucher > > Bug: > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitl > > > ab.freedesktop.org%2Fdrm%2Famd%2Fissues%2F961&data=02%7C01% > 7Calexa > > > nder.deucher%40amd.com%7C30540b2bf2be417c4d9508d7765bf07f%7C3dd > 8961fe4 > > > 884e608e11a82d994e183d%7C0%7C0%7C637108010075463266&sdata=1 > ZIZUWos > > cPiB4auOY10jlGzoFeWszYMDBQG0CtrrOO8%3D&reserved=0 > > Signed-off-by: Kai-Heng Feng > > --- > > v2: > > - Find Stoney graphics instead of host bridge. > > > > drivers/iommu/amd_iommu_init.c | 13 - > > 1 file changed, 12 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/iommu/amd_iommu_init.c > > b/drivers/iommu/amd_iommu_init.c index 568c52317757..139aa6fdadda > > 100644 > > --- a/drivers/iommu/amd_iommu_init.c > > +++ b/drivers/iommu/amd_iommu_init.c > > @@ -2516,6 +2516,7 @@ static int __init early_amd_iommu_init(void) > > struct acpi_table_header *ivrs_base; > > acpi_status status; > > int i, remap_cache_sz, ret = 0; > > + u32 pci_id; > > > > if (!amd_iommu_detected) > > return -ENODEV; > > @@ -2603,6 +2604,16 @@ static int __init early_amd_iommu_init(void) > > if (ret) > > goto out; > > > > + /* Disable IOMMU if there's Stoney Ridge graphics */ > > + for (i = 0; i < 32; i++) { > > + pci_id = read_pci_config(0, i, 0, 0); > > + if ((pci_id & 0x) == 0x1002 && (pci_id >> 16) == 0x98e4) { > > + pr_info("Disable IOMMU on Stoney Ridge\n"); > > + amd_iommu_disabled = true; > > + break; > > + } > > + } > > + > > /* Disable any previously enabled IOMMUs */ > > if (!is_kdump_kernel() || amd_iommu_disabled) > > disable_iommus(); > > @@ -2711,7 +2722,7 @@ static int __init state_next(void) > > ret = early_amd_iommu_init(); > > init_state = ret ? IOMMU_INIT_ERROR : > IOMMU_ACPI_FINISHED; > > if (init_state == IOMMU_ACPI_FINISHED && > amd_iommu_disabled) { > > - pr_info("AMD IOMMU disabled on kernel command- > line\n"); > > + pr_info("AMD IOMMU disabled\n"); > > init_state = IOMMU_CMDLINE_DISABLED; > > ret = -EINVAL; > > } ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: dmar pte read access not set error messages on hp dl388 gen8 systems
On Mon Dec 02 19, Jerry Snitselaar wrote: On Mon Dec 02 19, Lu Baolu wrote: Hi, On 12/2/19 2:34 PM, Jerry Snitselaar wrote: We are seeing DMAR PTE read access not set errors when booting a kernel with default passthrough, both with a test kernel and with a 5.4.0 kernel. Previously we would see a number of identity mappings being set related to the rmrrs, and now they aren't seen and we get the dmar pte errors as devices touch those regions. From what I can tell currently df4f3c603aeb ("iommu/vt-d: Remove static identity map code") removed the bit of code in init_dmars that used to set up those mappings: - /* - * For each rmrr - * for each dev attached to rmrr - * do - * locate drhd for dev, alloc domain for dev - * allocate free domain - * allocate page table entries for rmrr - * if context not allocated for bus - * allocate and init context - * set present in root table for this bus - * init context with domain, translation etc - * endfor - * endfor - */ - pr_info("Setting RMRR:\n"); - for_each_rmrr_units(rmrr) { - /* some BIOS lists non-exist devices in DMAR table. */ - for_each_active_dev_scope(rmrr->devices, rmrr->devices_cnt, - i, dev) { - ret = iommu_prepare_rmrr_dev(rmrr, dev); - if (ret) - pr_err("Mapping reserved region failed\n"); - } - } si_domain_init now has code that sets identity maps for devices in rmrrs, but only for certain devices. On which device, are you seeing this error? Is it a rmrr locked device? Best regards, baolu Almost all of the messages are for the ilo, but there also is a message for the smart array raid bus controller. Also seeing it with a dl380 gen9 system, where the raid bus controller is getting the error. With iommu=nopt, the system boots up without issue. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH v3 0/4] dma-mapping: introduce new dma unmap and sync variants
> -Original Message- > From: Christoph Hellwig > To: David Miller > Subject: Re: [PATCH v3 0/4] dma-mapping: introduce new dma unmap and sync > variants > > On Wed, Nov 13, 2019 at 12:11:32PM -0800, David Miller wrote: > > > This series introduces a few new dma unmap and sync api variants > that, > > > on top of what the originals do, return the virtual address > > > corresponding to the input dma address. In order to do that a new dma > > > map op is added, .get_virt_addr that takes the input dma address and > > > returns the virtual address backing it up. > > > The second patch adds an implementation for this new dma map op in > the > > > generic iommu dma glue code and wires it in. > > > The third patch updates the dpaa2-eth driver to use the new apis. > > > > The driver should store the mapping in it's private software state if > > it needs this kind of conversion. > > I think the problem for this driver (and a few others) is that they > somehow manage to split I/O completions at a different boundary > than submissions. For me with my block I/O background this seems > weird, but I'll need networking folks to double check the theory. > > > This is huge precendence for this, and there is therefore no need to > > add even more complication and methods and burdon to architecture code > > for the sake of this. > > Unfortunately there are various drivers that abuse iommu_iova_to_phys > to get a struct page to unmap. Two of theose are network drivers > that went in through you (dpaa2 and thunder), additonally the > caam crypto driver (which I think is the same SOC family as dpaa, > but I'm not sure) and the AMD GPU driver. > > We also have drivers that just don't unmap and this don't work with > iommus or dma-debug (IBM EMAC another net driver). > > That being said I hate these new API, but I still think they are less > bad than these IOMMU API abuses people do now. If experienced > networking folks know a way to get rid of both I'm all for it. Hi, will this API be included during the v5.5 kernel development cycle or is there an alternative solution? Thank you, Madalin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [Patch v2 1/3] iommu: match the original algorithm
On 30/11/2019 05:58, Cong Wang wrote: On Fri, Nov 29, 2019 at 6:43 AM John Garry wrote: On 29/11/2019 00:48, Cong Wang wrote: The IOVA cache algorithm implemented in IOMMU code does not exactly match the original algorithm described in the paper. which paper? It's in drivers/iommu/iova.c, from line 769: 769 /* 770 * Magazine caches for IOVA ranges. For an introduction to magazines, 771 * see the USENIX 2001 paper "Magazines and Vmem: Extending the Slab 772 * Allocator to Many CPUs and Arbitrary Resources" by Bonwick and Adams. 773 * For simplicity, we use a static magazine size and don't implement the 774 * dynamic size tuning described in the paper. 775 */ Particularly, it doesn't need to free the loaded empty magazine when trying to put it back to global depot. To make it work, we have to pre-allocate magazines in the depot and only recycle them when all of them are full. Before this patch, rcache->depot[] contains either full or freed entries, after this patch, it contains either full or empty (but allocated) entries. I *quickly* tested this patch and got a small performance gain. Thanks for testing! It requires a different workload to see bigger gain, in our case, 24 memcache.parallel servers with 120 clients. So in fact I was getting a ~10% throughput boost for my storage test. Seems more than I would expect/hope for. I would need to test more. Cc: Joerg Roedel Signed-off-by: Cong Wang --- drivers/iommu/iova.c | 45 +++- 1 file changed, 28 insertions(+), 17 deletions(-) diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index 41c605b0058f..cb473ddce4cf 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -862,12 +862,16 @@ static void init_iova_rcaches(struct iova_domain *iovad) struct iova_cpu_rcache *cpu_rcache; struct iova_rcache *rcache; unsigned int cpu; - int i; + int i, j; for (i = 0; i < IOVA_RANGE_CACHE_MAX_SIZE; ++i) { rcache = &iovad->rcaches[i]; spin_lock_init(&rcache->lock); rcache->depot_size = 0; + for (j = 0; j < MAX_GLOBAL_MAGS; ++j) { + rcache->depot[j] = iova_magazine_alloc(GFP_KERNEL); + WARN_ON(!rcache->depot[j]); + } rcache->cpu_rcaches = __alloc_percpu(sizeof(*cpu_rcache), cache_line_size()); if (WARN_ON(!rcache->cpu_rcaches)) continue; @@ -900,24 +904,30 @@ static bool __iova_rcache_insert(struct iova_domain *iovad, if (!iova_magazine_full(cpu_rcache->loaded)) { can_insert = true; - } else if (!iova_magazine_full(cpu_rcache->prev)) { + } else if (iova_magazine_empty(cpu_rcache->prev)) { is this change strictly necessary? Yes, because it is what described in the paper. But it should be functionally same because cpu_rcache->prev is either full or empty. That is was what I was getting at. swap(cpu_rcache->prev, cpu_rcache->loaded); can_insert = true; } else { - struct iova_magazine *new_mag = iova_magazine_alloc(GFP_ATOMIC); Apart from this change, did anyone ever consider kmem cache for the magazines? + spin_lock(&rcache->lock); + if (rcache->depot_size < MAX_GLOBAL_MAGS) { + swap(rcache->depot[rcache->depot_size], cpu_rcache->prev); + swap(cpu_rcache->prev, cpu_rcache->loaded); + rcache->depot_size++; + can_insert = true; + } else { + mag_to_free = cpu_rcache->loaded; + } + spin_unlock(&rcache->lock); + + if (mag_to_free) { + struct iova_magazine *new_mag = iova_magazine_alloc(GFP_ATOMIC); - if (new_mag) { - spin_lock(&rcache->lock); - if (rcache->depot_size < MAX_GLOBAL_MAGS) { - rcache->depot[rcache->depot_size++] = - cpu_rcache->loaded; + if (new_mag) { + cpu_rcache->loaded = new_mag; + can_insert = true; } else { - mag_to_free = cpu_rcache->loaded; + mag_to_free = NULL; } - spin_unlock(&rcache->lock); - - cpu_rcache->loaded = new_mag; - can_insert = true; } } @@ -963,14 +973,15 @@ static unsigned long __iova_rcache_get(struct iova_rcache *rcache, if (!iova_magazine_empty(cpu_rcache->loaded)) { has_pfn = true; - } else if (!iova_magazine_empty(cpu_rcache->prev)) { + } else if (iova_magazine_full(cpu_rcache->prev)) { swap(cpu_rcache->prev, cpu_rcache
Re: [Patch v2 2/3] iommu: optimize iova_magazine_free_pfns()
On 30/11/2019 06:02, Cong Wang wrote: On Fri, Nov 29, 2019 at 5:24 AM John Garry wrote: On 29/11/2019 00:48, Cong Wang wrote: If the maganize is empty, iova_magazine_free_pfns() should magazine Good catch! be a nop, however it misses the case of mag->size==0. So we should just call iova_magazine_empty(). This should reduce the contention on iovad->iova_rbtree_lock a little bit. Cc: Joerg Roedel Signed-off-by: Cong Wang --- drivers/iommu/iova.c | 22 +++--- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index cb473ddce4cf..184d4c0e20b5 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -797,13 +797,23 @@ static void iova_magazine_free(struct iova_magazine *mag) kfree(mag); } +static bool iova_magazine_full(struct iova_magazine *mag) +{ + return (mag && mag->size == IOVA_MAG_SIZE); +} + +static bool iova_magazine_empty(struct iova_magazine *mag) +{ + return (!mag || mag->size == 0); +} + static void iova_magazine_free_pfns(struct iova_magazine *mag, struct iova_domain *iovad) { unsigned long flags; int i; - if (!mag) + if (iova_magazine_empty(mag)) The only hot path we this call is __iova_rcache_insert()->iova_magazine_free_pfns(mag_to_free) and mag_to_free is full in this case, so I am sure how the additional check helps, right? This is what I mean by "a little bit" in changelog, did you miss it or misunderstand it? :) I was concerned that in the fastpath we actually make things very marginally slower by adding a check which will fail. Thanks, John Thanks. . ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu