[PATCH 1/1] iommu/vt-d: Correctly check addr alignment in qi_flush_dev_iotlb_pasid()

2021-01-18 Thread Lu Baolu
An incorrect address mask is being used in the qi_flush_dev_iotlb_pasid()
to check the address alignment. This leads to a lot of spurious kernel
warnings:

[  485.837093] DMAR: Invalidate non-aligned address 7f76f47f9000, order 0
[  485.837098] DMAR: Invalidate non-aligned address 7f76f47f9000, order 0
[  492.494145] qi_flush_dev_iotlb_pasid: 5734 callbacks suppressed
[  492.494147] DMAR: Invalidate non-aligned address 7f772880, order 11
[  492.508965] DMAR: Invalidate non-aligned address 7f772880, order 11

Fix it by checking the alignment in right way.

Fixes: 288d08e780088 ("iommu/vt-d: Handle non-page aligned address")
Reported-and-tested-by: Guo Kaijie 
Signed-off-by: Lu Baolu 
Cc: Liu Yi L 
---
 drivers/iommu/intel/dmar.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/intel/dmar.c b/drivers/iommu/intel/dmar.c
index 004feaed3c72..02e7c10a4224 100644
--- a/drivers/iommu/intel/dmar.c
+++ b/drivers/iommu/intel/dmar.c
@@ -1496,7 +1496,7 @@ void qi_flush_dev_iotlb_pasid(struct intel_iommu *iommu, 
u16 sid, u16 pfsid,
 * Max Invs Pending (MIP) is set to 0 for now until we have DIT in
 * ECAP.
 */
-   if (addr & GENMASK_ULL(size_order + VTD_PAGE_SHIFT, 0))
+   if (!IS_ALIGNED(addr, VTD_PAGE_SIZE << size_order))
pr_warn_ratelimited("Invalidate non-aligned address %llx, order 
%d\n",
addr, size_order);
 
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 1/3] iommu/vt-d: Audit IOMMU Capabilities and add helper functions

2021-01-18 Thread Lu Baolu

Hi,

On 1/6/21 9:30 AM, Kyung Min Park wrote:

Audit IOMMU Capability/Extended Capability and check if the IOMMUs have
the consistent value for features. Report out or scale to the lowest
supported when IOMMU features have incompatibility among IOMMUs.

Report out features when below features are mismatched:
   - First Level 5 Level Paging Support (FL5LP)
   - First Level 1 GByte Page Support (FL1GP)
   - Read Draining (DRD)
   - Write Draining (DWD)
   - Page Selective Invalidation (PSI)
   - Zero Length Read (ZLR)
   - Caching Mode (CM)
   - Protected High/Low-Memory Region (PHMR/PLMR)
   - Required Write-Buffer Flushing (RWBF)
   - Advanced Fault Logging (AFL)
   - RID-PASID Support (RPS)
   - Scalable Mode Page Walk Coherency (SMPWC)
   - First Level Translation Support (FLTS)
   - Second Level Translation Support (SLTS)
   - No Write Flag Support (NWFS)
   - Second Level Accessed/Dirty Support (SLADS)
   - Virtual Command Support (VCS)
   - Scalable Mode Translation Support (SMTS)
   - Device TLB Invalidation Throttle (DIT)
   - Page Drain Support (PDS)
   - Process Address Space ID Support (PASID)
   - Extended Accessed Flag Support (EAFS)
   - Supervisor Request Support (SRS)
   - Execute Request Support (ERS)
   - Page Request Support (PRS)
   - Nested Translation Support (NEST)
   - Snoop Control (SC)
   - Pass Through (PT)
   - Device TLB Support (DT)
   - Queued Invalidation (QI)
   - Page walk Coherency (C)

Set capability to the lowest supported when below features are mismatched:
   - Maximum Address Mask Value (MAMV)
   - Number of Fault Recording Registers (NFR)
   - Second Level Large Page Support (SLLPS)
   - Fault Recording Offset (FRO)
   - Maximum Guest Address Width (MGAW)
   - Supported Adjusted Guest Address Width (SAGAW)
   - Number of Domains supported (NDOMS)
   - Pasid Size Supported (PSS)
   - Maximum Handle Mask Value (MHMV)
   - IOTLB Register Offset (IRO)

Signed-off-by: Kyung Min Park 
---
  drivers/iommu/intel/Makefile|   4 +-
  drivers/iommu/intel/cap_audit.c | 188 
  drivers/iommu/intel/cap_audit.h | 106 
  drivers/iommu/intel/iommu.c |   9 ++
  drivers/iommu/intel/irq_remapping.c |   8 ++
  include/linux/intel-iommu.h |  39 +++---
  6 files changed, 334 insertions(+), 20 deletions(-)
  create mode 100644 drivers/iommu/intel/cap_audit.c
  create mode 100644 drivers/iommu/intel/cap_audit.h

diff --git a/drivers/iommu/intel/Makefile b/drivers/iommu/intel/Makefile
index fb8e1e8c8029..1e576c840175 100644
--- a/drivers/iommu/intel/Makefile
+++ b/drivers/iommu/intel/Makefile
@@ -1,7 +1,7 @@
  # SPDX-License-Identifier: GPL-2.0
  obj-$(CONFIG_DMAR_TABLE) += dmar.o
  obj-$(CONFIG_INTEL_IOMMU) += iommu.o pasid.o
-obj-$(CONFIG_INTEL_IOMMU) += trace.o
+obj-$(CONFIG_INTEL_IOMMU) += trace.o cap_audit.o
  obj-$(CONFIG_INTEL_IOMMU_DEBUGFS) += debugfs.o
  obj-$(CONFIG_INTEL_IOMMU_SVM) += svm.o
-obj-$(CONFIG_IRQ_REMAP) += irq_remapping.o
+obj-$(CONFIG_IRQ_REMAP) += irq_remapping.o cap_audit.o
diff --git a/drivers/iommu/intel/cap_audit.c b/drivers/iommu/intel/cap_audit.c
new file mode 100644
index ..4a062fc719d2
--- /dev/null
+++ b/drivers/iommu/intel/cap_audit.c
@@ -0,0 +1,188 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * cap_audit.c - audit iommu capabilities for boot time and hot plug
+ *
+ * Copyright (C) 2020 Intel Corporation


Update to 2021. The same to other places.


+ *
+ * Author: Kyung Min Park 
+ * Lu Baolu 
+ */
+
+#define pr_fmt(fmt)"DMAR: " fmt
+
+#include 
+#include "cap_audit.h"
+
+static u64 intel_iommu_cap_sanity;
+static u64 intel_iommu_ecap_sanity;
+
+static inline void check_irq_capabilities(struct intel_iommu *a,
+ struct intel_iommu *b)
+{
+   CHECK_FEATURE_MISMATCH(a, b, cap, pi_support, CAP_PI_MASK);
+   CHECK_FEATURE_MISMATCH(a, b, ecap, eim_support, ECAP_EIM_MASK);
+}
+
+static inline void check_dmar_capabilities(struct intel_iommu *a,
+  struct intel_iommu *b)
+{
+   MINIMAL_FEATURE_IOMMU(b, cap, CAP_MAMV_MASK);
+   MINIMAL_FEATURE_IOMMU(b, cap, CAP_NFR_MASK);
+   MINIMAL_FEATURE_IOMMU(b, cap, CAP_SLLPS_MASK);
+   MINIMAL_FEATURE_IOMMU(b, cap, CAP_FRO_MASK);
+   MINIMAL_FEATURE_IOMMU(b, cap, CAP_MGAW_MASK);
+   MINIMAL_FEATURE_IOMMU(b, cap, CAP_SAGAW_MASK);
+   MINIMAL_FEATURE_IOMMU(b, cap, CAP_NDOMS_MASK);
+   MINIMAL_FEATURE_IOMMU(b, ecap, ECAP_PSS_MASK);
+   MINIMAL_FEATURE_IOMMU(b, ecap, ECAP_MHMV_MASK);
+   MINIMAL_FEATURE_IOMMU(b, ecap, ECAP_IRO_MASK);
+
+   CHECK_FEATURE_MISMATCH(a, b, cap, 5lp_support, CAP_FL5LP_MASK);
+   CHECK_FEATURE_MISMATCH(a, b, cap, fl1gp_support, CAP_FL1GP_MASK);
+   CHECK_FEATURE_MISMATCH(a, b, cap, read_drain, CAP_RD_MASK);
+   CHECK_FEATURE_MISMATCH(a, b, cap, write_drain, CAP_WD_MASK);
+   CHECK_FEATURE_MISMATCH(a, b, cap, pgsel_inv, CAP_PSI_MASK);
+   

[PATCH 0/2] Use another method to avoid resource conflicts between the SMMU and PMCG drivers

2021-01-18 Thread Zhen Lei
Since the PMCG may implement its resigters space(4KB Page0 and 4KB Page1)
within the SMMUv3 64KB Page0. In this case, when the SMMUv3 driver reserves the
64KB Page0 resource in advance, the PMCG driver try to reserve its Page0 and
Page1 resources, a resource conflict occurs.

commit 52f3fab0067d6fa ("iommu/arm-smmu-v3: Don't reserve implementation
defined register space") reduce the resource reservation range of the SMMUv3
driver, it only reserves the first 0xe00 bytes in the 64KB Page0, to avoid
the above-mentioned resource conflicts.

But the SMMUv3.3 add support for ECMDQ, its registers space is also implemented
in the SMMUv3 64KB Page0. This means we need to build two separate mappings.
New features may be added in the future, and more independent mappings may be
required. The simple problem is complicated because the user expects to map the
entire SMMUv3 64KB Page0.

Therefore, the proper solution is: If the PMCG register resources are located in
the 64KB Page0 of the SMMU, the PMCG driver does not reserve the conflict 
resources
when the SMMUv3 driver has reserved the conflict resources before. Instead, the 
PMCG
driver only performs devm_ioremap() to ensure that it can work properly.


Zhen Lei (2):
  perf/smmuv3: Don't reserve the register space that overlaps with the
SMMUv3
  Revert "iommu/arm-smmu-v3: Don't reserve implementation defined
register space"

 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 32 +++---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  3 ---
 drivers/perf/arm_smmuv3_pmu.c   | 42 +++--
 3 files changed, 44 insertions(+), 33 deletions(-)

-- 
1.8.3


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 2/2] Revert "iommu/arm-smmu-v3: Don't reserve implementation defined register space"

2021-01-18 Thread Zhen Lei
This reverts commit 52f3fab0067d6fa9e99c1b7f63265dd48ca76046.

This problem has been fixed by another patch. The original method had side
effects, it was not mapped to the user-specified resource size. The code
will become more complex when ECMDQ is supported later.

Signed-off-by: Zhen Lei 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 32 -
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  3 ---
 2 files changed, 4 insertions(+), 31 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 8ca7415d785d9bf..477f473842e5272 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -91,8 +91,9 @@ struct arm_smmu_option_prop {
 static inline void __iomem *arm_smmu_page1_fixup(unsigned long offset,
 struct arm_smmu_device *smmu)
 {
-   if (offset > SZ_64K)
-   return smmu->page1 + offset - SZ_64K;
+   if ((offset > SZ_64K) &&
+   (smmu->options & ARM_SMMU_OPT_PAGE0_REGS_ONLY))
+   offset -= SZ_64K;
 
return smmu->base + offset;
 }
@@ -3486,18 +3487,6 @@ static int arm_smmu_set_bus_ops(struct iommu_ops *ops)
return err;
 }
 
-static void __iomem *arm_smmu_ioremap(struct device *dev, resource_size_t 
start,
- resource_size_t size)
-{
-   struct resource res = {
-   .flags = IORESOURCE_MEM,
-   .start = start,
-   .end = start + size - 1,
-   };
-
-   return devm_ioremap_resource(dev, );
-}
-
 static int arm_smmu_device_probe(struct platform_device *pdev)
 {
int irq, ret;
@@ -3533,23 +3522,10 @@ static int arm_smmu_device_probe(struct platform_device 
*pdev)
}
ioaddr = res->start;
 
-   /*
-* Don't map the IMPLEMENTATION DEFINED regions, since they may contain
-* the PMCG registers which are reserved by the PMU driver.
-*/
-   smmu->base = arm_smmu_ioremap(dev, ioaddr, ARM_SMMU_REG_SZ);
+   smmu->base = devm_ioremap_resource(dev, res);
if (IS_ERR(smmu->base))
return PTR_ERR(smmu->base);
 
-   if (arm_smmu_resource_size(smmu) > SZ_64K) {
-   smmu->page1 = arm_smmu_ioremap(dev, ioaddr + SZ_64K,
-  ARM_SMMU_REG_SZ);
-   if (IS_ERR(smmu->page1))
-   return PTR_ERR(smmu->page1);
-   } else {
-   smmu->page1 = smmu->base;
-   }
-
/* Interrupt lines */
 
irq = platform_get_irq_byname_optional(pdev, "combined");
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 96c2e9565e00282..0c3090c60840c22 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -152,8 +152,6 @@
 #define ARM_SMMU_PRIQ_IRQ_CFG1 0xd8
 #define ARM_SMMU_PRIQ_IRQ_CFG2 0xdc
 
-#define ARM_SMMU_REG_SZ0xe00
-
 /* Common MSI config fields */
 #define MSI_CFG0_ADDR_MASK GENMASK_ULL(51, 2)
 #define MSI_CFG2_SHGENMASK(5, 4)
@@ -584,7 +582,6 @@ struct arm_smmu_strtab_cfg {
 struct arm_smmu_device {
struct device   *dev;
void __iomem*base;
-   void __iomem*page1;
 
 #define ARM_SMMU_FEAT_2_LVL_STRTAB (1 << 0)
 #define ARM_SMMU_FEAT_2_LVL_CDTAB  (1 << 1)
-- 
1.8.3


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 1/2] perf/smmuv3: Don't reserve the register space that overlaps with the SMMUv3

2021-01-18 Thread Zhen Lei
Some SMMUv3 implementation embed the Perf Monitor Group Registers (PMCG)
inside the first 64kB region of the SMMU. Since SMMU and PMCG are managed
by two separate drivers, and this driver depends on ARM_SMMU_V3, so the
SMMU driver reserves the corresponding resource first, this driver should
not reserve the corresponding resource again. Otherwise, a resource
reservation conflict is reported during boot.

Signed-off-by: Zhen Lei 
---
 drivers/perf/arm_smmuv3_pmu.c | 42 --
 1 file changed, 40 insertions(+), 2 deletions(-)

diff --git a/drivers/perf/arm_smmuv3_pmu.c b/drivers/perf/arm_smmuv3_pmu.c
index 74474bb322c3f26..dcce085431c6ce8 100644
--- a/drivers/perf/arm_smmuv3_pmu.c
+++ b/drivers/perf/arm_smmuv3_pmu.c
@@ -761,6 +761,44 @@ static void smmu_pmu_get_acpi_options(struct smmu_pmu 
*smmu_pmu)
dev_notice(smmu_pmu->dev, "option mask 0x%x\n", smmu_pmu->options);
 }
 
+static void __iomem *
+smmu_pmu_get_and_ioremap_resource(struct platform_device *pdev,
+ unsigned int index,
+ struct resource **out_res)
+{
+   int ret;
+   void __iomem *base;
+   struct resource *res;
+
+   res = platform_get_resource(pdev, IORESOURCE_MEM, index);
+   if (!res) {
+   dev_err(>dev, "invalid resource\n");
+   return IOMEM_ERR_PTR(-EINVAL);
+   }
+   if (out_res)
+   *out_res = res;
+
+   ret = region_intersects(res->start, resource_size(res),
+   IORESOURCE_MEM, IORES_DESC_NONE);
+   if (ret == REGION_INTERSECTS) {
+   /*
+* The resource has already been reserved by the SMMUv3 driver.
+* Don't reserve it again, just do devm_ioremap().
+*/
+   base = devm_ioremap(>dev, res->start, resource_size(res));
+   } else {
+   /*
+* The resource may have not been reserved by any driver, or
+* has been reserved but not type IORESOURCE_MEM. In the latter
+* case, devm_ioremap_resource() reports a conflict and returns
+* IOMEM_ERR_PTR(-EBUSY).
+*/
+   base = devm_ioremap_resource(>dev, res);
+   }
+
+   return base;
+}
+
 static int smmu_pmu_probe(struct platform_device *pdev)
 {
struct smmu_pmu *smmu_pmu;
@@ -793,7 +831,7 @@ static int smmu_pmu_probe(struct platform_device *pdev)
.capabilities   = PERF_PMU_CAP_NO_EXCLUDE,
};
 
-   smmu_pmu->reg_base = devm_platform_get_and_ioremap_resource(pdev, 0, 
_0);
+   smmu_pmu->reg_base = smmu_pmu_get_and_ioremap_resource(pdev, 0, _0);
if (IS_ERR(smmu_pmu->reg_base))
return PTR_ERR(smmu_pmu->reg_base);
 
@@ -801,7 +839,7 @@ static int smmu_pmu_probe(struct platform_device *pdev)
 
/* Determine if page 1 is present */
if (cfgr & SMMU_PMCG_CFGR_RELOC_CTRS) {
-   smmu_pmu->reloc_base = devm_platform_ioremap_resource(pdev, 1);
+   smmu_pmu->reloc_base = smmu_pmu_get_and_ioremap_resource(pdev, 
1, NULL);
if (IS_ERR(smmu_pmu->reloc_base))
return PTR_ERR(smmu_pmu->reloc_base);
} else {
-- 
1.8.3


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: performance regression noted in v5.11-rc after c062db039f40

2021-01-18 Thread Lu Baolu

Hi Chuck,

On 1/19/21 4:09 AM, Chuck Lever wrote:




On Jan 18, 2021, at 1:00 PM, Robin Murphy  wrote:

On 2021-01-18 16:18, Chuck Lever wrote:

On Jan 12, 2021, at 9:38 AM, Will Deacon  wrote:

[Expanding cc list to include DMA-IOMMU and intel IOMMU folks]

On Fri, Jan 08, 2021 at 04:18:36PM -0500, Chuck Lever wrote:

Hi-

[ Please cc: me on replies, I'm not currently subscribed to
iommu@lists ].

I'm running NFS performance tests on InfiniBand using CX-3 Pro cards
at 56Gb/s. The test is iozone on an NFSv3/RDMA mount:

/home/cel/bin/iozone -M -+u -i0 -i1 -s1g -r256k -t12 -I

For those not familiar with the way storage protocols use RDMA, The
initiator/client sets up memory regions and the target/server uses
RDMA Read and Write to move data out of and into those regions. The
initiator/client uses only RDMA memory registration and invalidation
operations, and the target/server uses RDMA Read and Write.

My NFS client is a two-socket 12-core x86_64 system with its I/O MMU
enabled using the kernel command line options "intel_iommu=on
iommu=strict".

Recently I've noticed a significant (25-30%) loss in NFS throughput.
I was able to bisect on my client to the following commits.

Here's 65f746e8285f ("iommu: Add quirk for Intel graphic devices in
map_sg"). This is about normal for this test.

Children see throughput for 12 initial writers  = 4732581.09 kB/sec
Parent sees throughput for 12 initial writers   = 4646810.21 kB/sec
Min throughput per process  =  387764.34 kB/sec
Max throughput per process  =  399655.47 kB/sec
Avg throughput per process  =  394381.76 kB/sec
Min xfer= 1017344.00 kB
CPU Utilization: Wall time2.671CPU time1.974CPU 
utilization  73.89 %
Children see throughput for 12 rewriters= 4837741.94 kB/sec
Parent sees throughput for 12 rewriters = 4833509.35 kB/sec
Min throughput per process  =  398983.72 kB/sec
Max throughput per process  =  406199.66 kB/sec
Avg throughput per process  =  403145.16 kB/sec
Min xfer= 1030656.00 kB
CPU utilization: Wall time2.584CPU time1.959CPU 
utilization  75.82 %
Children see throughput for 12 readers  = 5921370.94 kB/sec
Parent sees throughput for 12 readers   = 5914106.69 kB/sec
Min throughput per process  =  491812.38 kB/sec
Max throughput per process  =  494777.28 kB/sec
Avg throughput per process  =  493447.58 kB/sec
Min xfer= 1042688.00 kB
CPU utilization: Wall time2.122CPU time1.968CPU 
utilization  92.75 %
Children see throughput for 12 re-readers   = 5947985.69 kB/sec
Parent sees throughput for 12 re-readers= 5941348.51 kB/sec
Min throughput per process  =  492805.81 kB/sec
Max throughput per process  =  497280.19 kB/sec
Avg throughput per process  =  495665.47 kB/sec
Min xfer= 1039360.00 kB
CPU utilization: Wall time2.111CPU time1.968CPU 
utilization  93.22 %

Here's c062db039f40 ("iommu/vt-d: Update domain geometry in
iommu_ops.at(de)tach_dev"). It's losing some steam here.

Children see throughput for 12 initial writers  = 4342419.12 kB/sec
Parent sees throughput for 12 initial writers   = 4310612.79 kB/sec
Min throughput per process  =  359299.06 kB/sec
Max throughput per process  =  363866.16 kB/sec
Avg throughput per process  =  361868.26 kB/sec
Min xfer= 1035520.00 kB
CPU Utilization: Wall time2.902CPU time1.951CPU 
utilization  67.22 %
Children see throughput for 12 rewriters= 4408576.66 kB/sec
Parent sees throughput for 12 rewriters = 4404280.87 kB/sec
Min throughput per process  =  364553.88 kB/sec
Max throughput per process  =  370029.28 kB/sec
Avg throughput per process  =  367381.39 kB/sec
Min xfer= 1033216.00 kB
CPU utilization: Wall time2.836CPU time1.956CPU 
utilization  68.97 %
Children see throughput for 12 readers  = 5406879.47 kB/sec
Parent sees throughput for 12 readers   = 5401862.78 kB/sec
Min throughput per process  =  449583.03 kB/sec
Max throughput per process  =  

Re: [PATCH v2 5/7] drm/msm: Add dependency on io-pgtable-arm format module

2021-01-18 Thread Will Deacon
On Mon, Jan 18, 2021 at 01:16:03PM -0800, Rob Clark wrote:
> On Mon, Dec 21, 2020 at 4:44 PM Isaac J. Manjarres
>  wrote:
> >
> > The MSM DRM driver depends on the availability of the ARM LPAE io-pgtable
> > format code to work properly. In preparation for having the io-pgtable
> > formats as modules, add a "pre" dependency with MODULE_SOFTDEP() to
> > ensure that the io-pgtable-arm format module is loaded before loading
> > the MSM DRM driver module.
> >
> > Signed-off-by: Isaac J. Manjarres 
> 
> Thanks, I've queued this up locally

I don't plan to make the io-pgtable code modular, so please drop this patch.

https://lore.kernel.org/r/20210106123428.GA1798@willie-the-truck

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 5/7] drm/msm: Add dependency on io-pgtable-arm format module

2021-01-18 Thread Rob Clark
On Mon, Dec 21, 2020 at 4:44 PM Isaac J. Manjarres
 wrote:
>
> The MSM DRM driver depends on the availability of the ARM LPAE io-pgtable
> format code to work properly. In preparation for having the io-pgtable
> formats as modules, add a "pre" dependency with MODULE_SOFTDEP() to
> ensure that the io-pgtable-arm format module is loaded before loading
> the MSM DRM driver module.
>
> Signed-off-by: Isaac J. Manjarres 

Thanks, I've queued this up locally

BR,
-R

> ---
>  drivers/gpu/drm/msm/msm_drv.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c
> index 535a026..8be3506 100644
> --- a/drivers/gpu/drm/msm/msm_drv.c
> +++ b/drivers/gpu/drm/msm/msm_drv.c
> @@ -1369,3 +1369,4 @@ module_exit(msm_drm_unregister);
>  MODULE_AUTHOR("Rob Clark   MODULE_DESCRIPTION("MSM DRM Driver");
>  MODULE_LICENSE("GPL");
> +MODULE_SOFTDEP("pre: io-pgtable-arm");
> --
> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
> a Linux Foundation Collaborative Project
>
> ___
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: performance regression noted in v5.11-rc after c062db039f40

2021-01-18 Thread Chuck Lever



> On Jan 18, 2021, at 1:00 PM, Robin Murphy  wrote:
> 
> On 2021-01-18 16:18, Chuck Lever wrote:
>>> On Jan 12, 2021, at 9:38 AM, Will Deacon  wrote:
>>> 
>>> [Expanding cc list to include DMA-IOMMU and intel IOMMU folks]
>>> 
>>> On Fri, Jan 08, 2021 at 04:18:36PM -0500, Chuck Lever wrote:
 Hi-
 
 [ Please cc: me on replies, I'm not currently subscribed to
 iommu@lists ].
 
 I'm running NFS performance tests on InfiniBand using CX-3 Pro cards
 at 56Gb/s. The test is iozone on an NFSv3/RDMA mount:
 
 /home/cel/bin/iozone -M -+u -i0 -i1 -s1g -r256k -t12 -I
 
 For those not familiar with the way storage protocols use RDMA, The
 initiator/client sets up memory regions and the target/server uses
 RDMA Read and Write to move data out of and into those regions. The
 initiator/client uses only RDMA memory registration and invalidation
 operations, and the target/server uses RDMA Read and Write.
 
 My NFS client is a two-socket 12-core x86_64 system with its I/O MMU
 enabled using the kernel command line options "intel_iommu=on
 iommu=strict".
 
 Recently I've noticed a significant (25-30%) loss in NFS throughput.
 I was able to bisect on my client to the following commits.
 
 Here's 65f746e8285f ("iommu: Add quirk for Intel graphic devices in
 map_sg"). This is about normal for this test.
 
Children see throughput for 12 initial writers  = 4732581.09 kB/sec
Parent sees throughput for 12 initial writers   = 4646810.21 kB/sec
Min throughput per process  =  387764.34 kB/sec
Max throughput per process  =  399655.47 kB/sec
Avg throughput per process  =  394381.76 kB/sec
Min xfer= 1017344.00 kB
CPU Utilization: Wall time2.671CPU time1.974CPU 
 utilization  73.89 %
Children see throughput for 12 rewriters= 4837741.94 kB/sec
Parent sees throughput for 12 rewriters = 4833509.35 kB/sec
Min throughput per process  =  398983.72 kB/sec
Max throughput per process  =  406199.66 kB/sec
Avg throughput per process  =  403145.16 kB/sec
Min xfer= 1030656.00 kB
CPU utilization: Wall time2.584CPU time1.959CPU 
 utilization  75.82 %
Children see throughput for 12 readers  = 5921370.94 kB/sec
Parent sees throughput for 12 readers   = 5914106.69 kB/sec
Min throughput per process  =  491812.38 kB/sec
Max throughput per process  =  494777.28 kB/sec
Avg throughput per process  =  493447.58 kB/sec
Min xfer= 1042688.00 kB
CPU utilization: Wall time2.122CPU time1.968CPU 
 utilization  92.75 %
Children see throughput for 12 re-readers   = 5947985.69 kB/sec
Parent sees throughput for 12 re-readers= 5941348.51 kB/sec
Min throughput per process  =  492805.81 kB/sec
Max throughput per process  =  497280.19 kB/sec
Avg throughput per process  =  495665.47 kB/sec
Min xfer= 1039360.00 kB
CPU utilization: Wall time2.111CPU time1.968CPU 
 utilization  93.22 %
 
 Here's c062db039f40 ("iommu/vt-d: Update domain geometry in
 iommu_ops.at(de)tach_dev"). It's losing some steam here.
 
Children see throughput for 12 initial writers  = 4342419.12 kB/sec
Parent sees throughput for 12 initial writers   = 4310612.79 kB/sec
Min throughput per process  =  359299.06 kB/sec
Max throughput per process  =  363866.16 kB/sec
Avg throughput per process  =  361868.26 kB/sec
Min xfer= 1035520.00 kB
CPU Utilization: Wall time2.902CPU time1.951CPU 
 utilization  67.22 %
Children see throughput for 12 rewriters= 4408576.66 kB/sec
Parent sees throughput for 12 rewriters = 4404280.87 kB/sec
Min throughput per process  =  364553.88 kB/sec
Max throughput per process  =  370029.28 kB/sec
Avg throughput per process  =  367381.39 kB/sec
Min xfer= 1033216.00 kB
CPU utilization: Wall time2.836CPU time1.956CPU 
 utilization  68.97 %
Children see throughput for 12 readers  = 5406879.47 kB/sec
Parent sees throughput for 12 readers   = 

dma_mmap_coherent() breakage with mem_encrypt on AMD Ryzen

2021-01-18 Thread Takashi Iwai
Hi,

we've got a bug report recently about the garbage playback sound from
a PCI sound device with mem_encrypt on AMD Ryzen:
   https://bugzilla.kernel.org/show_bug.cgi?id=27

The debug session showed that this happens only with the mmap using
dma_mmap_coherent() and mem_encrypt.  The mmap with the legacy page
fault (as done in ALSA PCM core) seems still working even with the
mem_encrypt.

Since the problem could be observed on two different PCI drivers, it
looks like a generic problem of DMA-mmap implementation with AMD
memory encryption.

According to the reporter, it's seen on both 5.4.x and 5.10.x
kernels, hence it doesn't like a recent regression.

Can anyone take a look?


Thanks!

Takashi
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 7/7] iommu/mediatek: Remove the tlb-ops for v7s

2021-01-18 Thread Robin Murphy

On 2021-01-07 12:29, Yong Wu wrote:

Until now, we have already used the tlb operations from iommu framework,
then the tlb operations for v7s can be removed.

Correspondingly, Switch the paramenter "cookie" to the internal structure.


Reviewed-by: Robin Murphy 


Signed-off-by: Yong Wu 
---
  drivers/iommu/mtk_iommu.c | 27 ---
  1 file changed, 4 insertions(+), 23 deletions(-)

diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
index d3b8a1649093..86ab577c9520 100644
--- a/drivers/iommu/mtk_iommu.c
+++ b/drivers/iommu/mtk_iommu.c
@@ -182,10 +182,8 @@ static struct mtk_iommu_domain *to_mtk_domain(struct 
iommu_domain *dom)
return container_of(dom, struct mtk_iommu_domain, domain);
  }
  
-static void mtk_iommu_tlb_flush_all(void *cookie)

+static void mtk_iommu_tlb_flush_all(struct mtk_iommu_data *data)
  {
-   struct mtk_iommu_data *data = cookie;
-
for_each_m4u(data) {
writel_relaxed(F_INVLD_EN1 | F_INVLD_EN0,
   data->base + data->plat_data->inv_sel_reg);
@@ -195,9 +193,9 @@ static void mtk_iommu_tlb_flush_all(void *cookie)
  }
  
  static void mtk_iommu_tlb_flush_range_sync(unsigned long iova, size_t size,

-  size_t granule, void *cookie)
+  size_t granule,
+  struct mtk_iommu_data *data)
  {
-   struct mtk_iommu_data *data = cookie;
unsigned long flags;
int ret;
u32 tmp;
@@ -219,7 +217,7 @@ static void mtk_iommu_tlb_flush_range_sync(unsigned long 
iova, size_t size,
if (ret) {
dev_warn(data->dev,
 "Partial TLB flush timed out, falling back to full 
flush\n");
-   mtk_iommu_tlb_flush_all(cookie);
+   mtk_iommu_tlb_flush_all(data);
}
/* Clear the CPE status */
writel_relaxed(0, data->base + REG_MMU_CPE_DONE);
@@ -227,22 +225,6 @@ static void mtk_iommu_tlb_flush_range_sync(unsigned long 
iova, size_t size,
}
  }
  
-static void mtk_iommu_tlb_flush_page_nosync(struct iommu_iotlb_gather *gather,

-   unsigned long iova, size_t granule,
-   void *cookie)
-{
-   struct mtk_iommu_data *data = cookie;
-   struct iommu_domain *domain = >m4u_dom->domain;
-
-   iommu_iotlb_gather_add_page(domain, gather, iova, granule);
-}
-
-static const struct iommu_flush_ops mtk_iommu_flush_ops = {
-   .tlb_flush_all = mtk_iommu_tlb_flush_all,
-   .tlb_flush_walk = mtk_iommu_tlb_flush_range_sync,
-   .tlb_add_page = mtk_iommu_tlb_flush_page_nosync,
-};
-
  static irqreturn_t mtk_iommu_isr(int irq, void *dev_id)
  {
struct mtk_iommu_data *data = dev_id;
@@ -326,7 +308,6 @@ static int mtk_iommu_domain_finalise(struct 
mtk_iommu_domain *dom)
.pgsize_bitmap = mtk_iommu_ops.pgsize_bitmap,
.ias = 32,
.oas = 34,
-   .tlb = _iommu_flush_ops,
.iommu_dev = data->dev,
};
  


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 6/7] iommu/mediatek: Gather iova in iommu_unmap to achieve tlb sync once

2021-01-18 Thread Robin Murphy

On 2021-01-07 12:29, Yong Wu wrote:

In current iommu_unmap, this code is:

iommu_iotlb_gather_init(_gather);
ret = __iommu_unmap(domain, iova, size, _gather);
iommu_iotlb_sync(domain, _gather);

We could gather the whole iova range in __iommu_unmap, and then do tlb
synchronization in the iommu_iotlb_sync.

This patch implement this, Gather the range in mtk_iommu_unmap.
then iommu_iotlb_sync call tlb synchronization for the gathered iova range.
we don't call iommu_iotlb_gather_add_page since our tlb synchronization
could be regardless of granule size.

In this way, gather->start is impossible ULONG_MAX, remove the checking.

This patch aims to do tlb synchronization *once* in the iommu_unmap.


Reviewed-by: Robin Murphy 


Signed-off-by: Yong Wu 
---
  drivers/iommu/mtk_iommu.c | 8 +---
  1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
index 66a00a2cb445..d3b8a1649093 100644
--- a/drivers/iommu/mtk_iommu.c
+++ b/drivers/iommu/mtk_iommu.c
@@ -430,7 +430,12 @@ static size_t mtk_iommu_unmap(struct iommu_domain *domain,
  struct iommu_iotlb_gather *gather)
  {
struct mtk_iommu_domain *dom = to_mtk_domain(domain);
+   unsigned long end = iova + size - 1;
  
+	if (gather->start > iova)

+   gather->start = iova;
+   if (gather->end < end)
+   gather->end = end;
return dom->iop->unmap(dom->iop, iova, size, gather);
  }
  
@@ -445,9 +450,6 @@ static void mtk_iommu_iotlb_sync(struct iommu_domain *domain,

struct mtk_iommu_data *data = mtk_iommu_get_m4u_data();
size_t length = gather->end - gather->start + 1;
  
-	if (gather->start == ULONG_MAX)

-   return;
-
mtk_iommu_tlb_flush_range_sync(gather->start, length, gather->pgsize,
   data);
  }


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 5/7] iommu/io-pgtable: Allow io_pgtable_tlb ops optional

2021-01-18 Thread Robin Murphy

On 2021-01-07 12:29, Yong Wu wrote:

This patch allows io_pgtable_tlb ops could be null since the IOMMU drivers
may use the tlb ops from iommu framework.


For the reasons I gave on v3,

Reviewed-by: Robin Murphy 


Signed-off-by: Yong Wu 
---
  include/linux/io-pgtable.h | 8 +---
  1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
index ea727eb1a1a9..2a5686ca2ba3 100644
--- a/include/linux/io-pgtable.h
+++ b/include/linux/io-pgtable.h
@@ -214,14 +214,16 @@ struct io_pgtable_domain_attr {
  
  static inline void io_pgtable_tlb_flush_all(struct io_pgtable *iop)

  {
-   iop->cfg.tlb->tlb_flush_all(iop->cookie);
+   if (iop->cfg.tlb && iop->cfg.tlb->tlb_flush_all)
+   iop->cfg.tlb->tlb_flush_all(iop->cookie);
  }
  
  static inline void

  io_pgtable_tlb_flush_walk(struct io_pgtable *iop, unsigned long iova,
  size_t size, size_t granule)
  {
-   iop->cfg.tlb->tlb_flush_walk(iova, size, granule, iop->cookie);
+   if (iop->cfg.tlb && iop->cfg.tlb->tlb_flush_walk)
+   iop->cfg.tlb->tlb_flush_walk(iova, size, granule, iop->cookie);
  }
  
  static inline void

@@ -229,7 +231,7 @@ io_pgtable_tlb_add_page(struct io_pgtable *iop,
struct iommu_iotlb_gather * gather, unsigned long iova,
size_t granule)
  {
-   if (iop->cfg.tlb->tlb_add_page)
+   if (iop->cfg.tlb && iop->cfg.tlb->tlb_add_page)
iop->cfg.tlb->tlb_add_page(gather, iova, granule, iop->cookie);
  }
  


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 4/7] iommu: Switch gather->end to the inclusive end

2021-01-18 Thread Robin Murphy

On 2021-01-07 12:29, Yong Wu wrote:

Currently gather->end is "unsigned long" which may be overflow in
arch32 in the corner case: 0xfff0 + 0x10(iova + size).
Although it doesn't affect the size(end - start), it affects the checking
"gather->end < end"

This patch changes this "end" to the real end address
(end = start + size - 1). Correspondingly, update the length to
"end - start + 1".

Fixes: a7d20dc19d9e ("iommu: Introduce struct iommu_iotlb_gather for batching TLB 
flushes")
Signed-off-by: Yong Wu 
---
  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 2 +-
  drivers/iommu/mtk_iommu.c   | 2 +-
  drivers/iommu/tegra-gart.c  | 2 +-
  include/linux/iommu.h   | 4 ++--
  4 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 8ca7415d785d..c70d6e79f534 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2280,7 +2280,7 @@ static void arm_smmu_iotlb_sync(struct iommu_domain 
*domain,
  {
struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
  
-	arm_smmu_tlb_inv_range(gather->start, gather->end - gather->start,

+   arm_smmu_tlb_inv_range(gather->start, gather->end - gather->start + 1,
   gather->pgsize, true, smmu_domain);
  }
  
diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c

index f579fc21971e..66a00a2cb445 100644
--- a/drivers/iommu/mtk_iommu.c
+++ b/drivers/iommu/mtk_iommu.c
@@ -443,7 +443,7 @@ static void mtk_iommu_iotlb_sync(struct iommu_domain 
*domain,
 struct iommu_iotlb_gather *gather)
  {
struct mtk_iommu_data *data = mtk_iommu_get_m4u_data();
-   size_t length = gather->end - gather->start;
+   size_t length = gather->end - gather->start + 1;
  
  	if (gather->start == ULONG_MAX)

return;
diff --git a/drivers/iommu/tegra-gart.c b/drivers/iommu/tegra-gart.c
index 05e8e19b8269..6f130e51f072 100644
--- a/drivers/iommu/tegra-gart.c
+++ b/drivers/iommu/tegra-gart.c
@@ -270,7 +270,7 @@ static void gart_iommu_sync_map(struct iommu_domain 
*domain, unsigned long iova,
  static void gart_iommu_sync(struct iommu_domain *domain,
struct iommu_iotlb_gather *gather)
  {
-   size_t length = gather->end - gather->start;
+   size_t length = gather->end - gather->start + 1;


TBH I don't think there's any need to bother doing precise calculations 
on effectively-uninitialised data (this driver doesn't even do 
address-based invalidation, let alone use the gather mechanism). In fact 
it might make sense to flip things around and define gart_iommu_sync_map 
in terms of gart_iommu_sync now just so there's one less unused argument 
to make up. However we can always do cleanup on top, and right now I'm 
more interested in getting these changes landed, so either way,


Reviewed-by: Robin Murphy 


gart_iommu_sync_map(domain, gather->start, length);
  }
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 9ce0aa9e236b..ae8eddd4621b 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -170,7 +170,7 @@ enum iommu_dev_features {
   * struct iommu_iotlb_gather - Range information for a pending IOTLB flush
   *
   * @start: IOVA representing the start of the range to be flushed
- * @end: IOVA representing the end of the range to be flushed (exclusive)
+ * @end: IOVA representing the end of the range to be flushed (inclusive)
   * @pgsize: The interval at which to perform the flush
   *
   * This structure is intended to be updated by multiple calls to the
@@ -539,7 +539,7 @@ static inline void iommu_iotlb_gather_add_page(struct 
iommu_domain *domain,
   struct iommu_iotlb_gather 
*gather,
   unsigned long iova, size_t size)
  {
-   unsigned long start = iova, end = start + size;
+   unsigned long start = iova, end = start + size - 1;
  
  	/*

 * If the new page is disjoint from the current range or is mapped at


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: performance regression noted in v5.11-rc after c062db039f40

2021-01-18 Thread Robin Murphy

On 2021-01-18 16:18, Chuck Lever wrote:




On Jan 12, 2021, at 9:38 AM, Will Deacon  wrote:

[Expanding cc list to include DMA-IOMMU and intel IOMMU folks]

On Fri, Jan 08, 2021 at 04:18:36PM -0500, Chuck Lever wrote:

Hi-

[ Please cc: me on replies, I'm not currently subscribed to
iommu@lists ].

I'm running NFS performance tests on InfiniBand using CX-3 Pro cards
at 56Gb/s. The test is iozone on an NFSv3/RDMA mount:

/home/cel/bin/iozone -M -+u -i0 -i1 -s1g -r256k -t12 -I

For those not familiar with the way storage protocols use RDMA, The
initiator/client sets up memory regions and the target/server uses
RDMA Read and Write to move data out of and into those regions. The
initiator/client uses only RDMA memory registration and invalidation
operations, and the target/server uses RDMA Read and Write.

My NFS client is a two-socket 12-core x86_64 system with its I/O MMU
enabled using the kernel command line options "intel_iommu=on
iommu=strict".

Recently I've noticed a significant (25-30%) loss in NFS throughput.
I was able to bisect on my client to the following commits.

Here's 65f746e8285f ("iommu: Add quirk for Intel graphic devices in
map_sg"). This is about normal for this test.

Children see throughput for 12 initial writers  = 4732581.09 kB/sec
Parent sees throughput for 12 initial writers   = 4646810.21 kB/sec
Min throughput per process  =  387764.34 kB/sec
Max throughput per process  =  399655.47 kB/sec
Avg throughput per process  =  394381.76 kB/sec
Min xfer= 1017344.00 kB
CPU Utilization: Wall time2.671CPU time1.974CPU 
utilization  73.89 %
Children see throughput for 12 rewriters= 4837741.94 kB/sec
Parent sees throughput for 12 rewriters = 4833509.35 kB/sec
Min throughput per process  =  398983.72 kB/sec
Max throughput per process  =  406199.66 kB/sec
Avg throughput per process  =  403145.16 kB/sec
Min xfer= 1030656.00 kB
CPU utilization: Wall time2.584CPU time1.959CPU 
utilization  75.82 %
Children see throughput for 12 readers  = 5921370.94 kB/sec
Parent sees throughput for 12 readers   = 5914106.69 kB/sec
Min throughput per process  =  491812.38 kB/sec
Max throughput per process  =  494777.28 kB/sec
Avg throughput per process  =  493447.58 kB/sec
Min xfer= 1042688.00 kB
CPU utilization: Wall time2.122CPU time1.968CPU 
utilization  92.75 %
Children see throughput for 12 re-readers   = 5947985.69 kB/sec
Parent sees throughput for 12 re-readers= 5941348.51 kB/sec
Min throughput per process  =  492805.81 kB/sec
Max throughput per process  =  497280.19 kB/sec
Avg throughput per process  =  495665.47 kB/sec
Min xfer= 1039360.00 kB
CPU utilization: Wall time2.111CPU time1.968CPU 
utilization  93.22 %

Here's c062db039f40 ("iommu/vt-d: Update domain geometry in
iommu_ops.at(de)tach_dev"). It's losing some steam here.

Children see throughput for 12 initial writers  = 4342419.12 kB/sec
Parent sees throughput for 12 initial writers   = 4310612.79 kB/sec
Min throughput per process  =  359299.06 kB/sec
Max throughput per process  =  363866.16 kB/sec
Avg throughput per process  =  361868.26 kB/sec
Min xfer= 1035520.00 kB
CPU Utilization: Wall time2.902CPU time1.951CPU 
utilization  67.22 %
Children see throughput for 12 rewriters= 4408576.66 kB/sec
Parent sees throughput for 12 rewriters = 4404280.87 kB/sec
Min throughput per process  =  364553.88 kB/sec
Max throughput per process  =  370029.28 kB/sec
Avg throughput per process  =  367381.39 kB/sec
Min xfer= 1033216.00 kB
CPU utilization: Wall time2.836CPU time1.956CPU 
utilization  68.97 %
Children see throughput for 12 readers  = 5406879.47 kB/sec
Parent sees throughput for 12 readers   = 5401862.78 kB/sec
Min throughput per process  =  449583.03 kB/sec
Max throughput per process  =  451761.69 kB/sec
Avg throughput per process  =  450573.29 kB/sec
Min 

Re: [PATCH v3 6/7] iommu/mediatek: Gather iova in iommu_unmap to achieve tlb sync once

2021-01-18 Thread Robin Murphy

On 2021-01-18 16:58, Will Deacon wrote:

On Mon, Jan 18, 2021 at 04:35:22PM +, Robin Murphy wrote:

On 2020-12-16 10:36, Yong Wu wrote:

In current iommu_unmap, this code is:

iommu_iotlb_gather_init(_gather);
ret = __iommu_unmap(domain, iova, size, _gather);
iommu_iotlb_sync(domain, _gather);

We could gather the whole iova range in __iommu_unmap, and then do tlb
synchronization in the iommu_iotlb_sync.

This patch implement this, Gather the range in mtk_iommu_unmap.
then iommu_iotlb_sync call tlb synchronization for the gathered iova range.
we don't call iommu_iotlb_gather_add_page since our tlb synchronization
could be regardless of granule size.

In this way, gather->start is impossible ULONG_MAX, remove the checking.

This patch aims to do tlb synchronization *once* in the iommu_unmap.


Assuming the update to patch #4 simply results in "unsigned long end = iova
+ size - 1;" here,

Reviewed-by: Robin Murphy 


There's a v4 here:

https://lore.kernel.org/r/20210107122909.16317-1-yong...@mediatek.com


Ha, so there is! Apparently I missed that in my post-holiday sweep last 
week and leant too heavily on the inbox-in-date-order assumption. Lemme 
just go catch up...


Thanks,
Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 6/7] iommu/mediatek: Gather iova in iommu_unmap to achieve tlb sync once

2021-01-18 Thread Will Deacon
On Mon, Jan 18, 2021 at 04:35:22PM +, Robin Murphy wrote:
> On 2020-12-16 10:36, Yong Wu wrote:
> > In current iommu_unmap, this code is:
> > 
> > iommu_iotlb_gather_init(_gather);
> > ret = __iommu_unmap(domain, iova, size, _gather);
> > iommu_iotlb_sync(domain, _gather);
> > 
> > We could gather the whole iova range in __iommu_unmap, and then do tlb
> > synchronization in the iommu_iotlb_sync.
> > 
> > This patch implement this, Gather the range in mtk_iommu_unmap.
> > then iommu_iotlb_sync call tlb synchronization for the gathered iova range.
> > we don't call iommu_iotlb_gather_add_page since our tlb synchronization
> > could be regardless of granule size.
> > 
> > In this way, gather->start is impossible ULONG_MAX, remove the checking.
> > 
> > This patch aims to do tlb synchronization *once* in the iommu_unmap.
> 
> Assuming the update to patch #4 simply results in "unsigned long end = iova
> + size - 1;" here,
> 
> Reviewed-by: Robin Murphy 

There's a v4 here:

https://lore.kernel.org/r/20210107122909.16317-1-yong...@mediatek.com

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 7/7] iommu/mediatek: Remove the tlb-ops for v7s

2021-01-18 Thread Robin Murphy

On 2020-12-16 10:36, Yong Wu wrote:

Until now, we have already used the tlb operations from iommu framework,
then the tlb operations for v7s can be removed.

Correspondingly, Switch the paramenter "cookie" to internal structure.


FWIW,

Reviewed-by: Robin Murphy 


Signed-off-by: Yong Wu 
---
  drivers/iommu/mtk_iommu.c | 27 ---
  1 file changed, 4 insertions(+), 23 deletions(-)

diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
index 89cec51405cd..5656819cd4a1 100644
--- a/drivers/iommu/mtk_iommu.c
+++ b/drivers/iommu/mtk_iommu.c
@@ -206,10 +206,8 @@ static struct mtk_iommu_domain *to_mtk_domain(struct 
iommu_domain *dom)
return container_of(dom, struct mtk_iommu_domain, domain);
  }
  
-static void mtk_iommu_tlb_flush_all(void *cookie)

+static void mtk_iommu_tlb_flush_all(struct mtk_iommu_data *data)
  {
-   struct mtk_iommu_data *data = cookie;
-
for_each_m4u(data) {
if (!pm_runtime_active(data->dev))
continue;
@@ -221,9 +219,9 @@ static void mtk_iommu_tlb_flush_all(void *cookie)
  }
  
  static void mtk_iommu_tlb_flush_range_sync(unsigned long iova, size_t size,

-  size_t granule, void *cookie)
+  size_t granule,
+  struct mtk_iommu_data *data)
  {
-   struct mtk_iommu_data *data = cookie;
unsigned long flags;
int ret;
u32 tmp;
@@ -250,7 +248,7 @@ static void mtk_iommu_tlb_flush_range_sync(unsigned long 
iova, size_t size,
if (ret) {
dev_warn(data->dev,
 "Partial TLB flush timed out, falling back to full 
flush\n");
-   mtk_iommu_tlb_flush_all(cookie);
+   mtk_iommu_tlb_flush_all(data);
}
/* Clear the CPE status */
writel_relaxed(0, data->base + REG_MMU_CPE_DONE);
@@ -258,22 +256,6 @@ static void mtk_iommu_tlb_flush_range_sync(unsigned long 
iova, size_t size,
}
  }
  
-static void mtk_iommu_tlb_flush_page_nosync(struct iommu_iotlb_gather *gather,

-   unsigned long iova, size_t granule,
-   void *cookie)
-{
-   struct mtk_iommu_data *data = cookie;
-   struct iommu_domain *domain = >m4u_dom->domain;
-
-   iommu_iotlb_gather_add_page(domain, gather, iova, granule);
-}
-
-static const struct iommu_flush_ops mtk_iommu_flush_ops = {
-   .tlb_flush_all = mtk_iommu_tlb_flush_all,
-   .tlb_flush_walk = mtk_iommu_tlb_flush_range_sync,
-   .tlb_add_page = mtk_iommu_tlb_flush_page_nosync,
-};
-
  static irqreturn_t mtk_iommu_isr(int irq, void *dev_id)
  {
struct mtk_iommu_data *data = dev_id;
@@ -380,7 +362,6 @@ static int mtk_iommu_domain_finalise(struct 
mtk_iommu_domain *dom)
.pgsize_bitmap = mtk_iommu_ops.pgsize_bitmap,
.ias = MTK_IOMMU_HAS_FLAG(data->plat_data, IOVA_34_EN) ? 34 : 
32,
.oas = 35,
-   .tlb = _iommu_flush_ops,
.iommu_dev = data->dev,
};
  


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 6/7] iommu/mediatek: Gather iova in iommu_unmap to achieve tlb sync once

2021-01-18 Thread Robin Murphy

On 2020-12-16 10:36, Yong Wu wrote:

In current iommu_unmap, this code is:

iommu_iotlb_gather_init(_gather);
ret = __iommu_unmap(domain, iova, size, _gather);
iommu_iotlb_sync(domain, _gather);

We could gather the whole iova range in __iommu_unmap, and then do tlb
synchronization in the iommu_iotlb_sync.

This patch implement this, Gather the range in mtk_iommu_unmap.
then iommu_iotlb_sync call tlb synchronization for the gathered iova range.
we don't call iommu_iotlb_gather_add_page since our tlb synchronization
could be regardless of granule size.

In this way, gather->start is impossible ULONG_MAX, remove the checking.

This patch aims to do tlb synchronization *once* in the iommu_unmap.


Assuming the update to patch #4 simply results in "unsigned long end = 
iova + size - 1;" here,


Reviewed-by: Robin Murphy 


Signed-off-by: Yong Wu 
---
  drivers/iommu/mtk_iommu.c | 8 +---
  1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
index db7d43adb06b..89cec51405cd 100644
--- a/drivers/iommu/mtk_iommu.c
+++ b/drivers/iommu/mtk_iommu.c
@@ -506,7 +506,12 @@ static size_t mtk_iommu_unmap(struct iommu_domain *domain,
  struct iommu_iotlb_gather *gather)
  {
struct mtk_iommu_domain *dom = to_mtk_domain(domain);
+   unsigned long long end = iova + size;
  
+	if (gather->start > iova)

+   gather->start = iova;
+   if (gather->end < end)
+   gather->end = end;
return dom->iop->unmap(dom->iop, iova, size, gather);
  }
  
@@ -523,9 +528,6 @@ static void mtk_iommu_iotlb_sync(struct iommu_domain *domain,

struct mtk_iommu_domain *dom = to_mtk_domain(domain);
size_t length = gather->end - gather->start;
  
-	if (gather->start == ULONG_MAX)

-   return;
-
mtk_iommu_tlb_flush_range_sync(gather->start, length, gather->pgsize,
   dom->data);
  }


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: performance regression noted in v5.11-rc after c062db039f40

2021-01-18 Thread Chuck Lever



> On Jan 12, 2021, at 9:38 AM, Will Deacon  wrote:
> 
> [Expanding cc list to include DMA-IOMMU and intel IOMMU folks]
> 
> On Fri, Jan 08, 2021 at 04:18:36PM -0500, Chuck Lever wrote:
>> Hi-
>> 
>> [ Please cc: me on replies, I'm not currently subscribed to
>> iommu@lists ].
>> 
>> I'm running NFS performance tests on InfiniBand using CX-3 Pro cards
>> at 56Gb/s. The test is iozone on an NFSv3/RDMA mount:
>> 
>> /home/cel/bin/iozone -M -+u -i0 -i1 -s1g -r256k -t12 -I
>> 
>> For those not familiar with the way storage protocols use RDMA, The
>> initiator/client sets up memory regions and the target/server uses
>> RDMA Read and Write to move data out of and into those regions. The
>> initiator/client uses only RDMA memory registration and invalidation
>> operations, and the target/server uses RDMA Read and Write.
>> 
>> My NFS client is a two-socket 12-core x86_64 system with its I/O MMU
>> enabled using the kernel command line options "intel_iommu=on
>> iommu=strict".
>> 
>> Recently I've noticed a significant (25-30%) loss in NFS throughput.
>> I was able to bisect on my client to the following commits.
>> 
>> Here's 65f746e8285f ("iommu: Add quirk for Intel graphic devices in
>> map_sg"). This is about normal for this test.
>> 
>>  Children see throughput for 12 initial writers  = 4732581.09 kB/sec
>>  Parent sees throughput for 12 initial writers   = 4646810.21 kB/sec
>>  Min throughput per process  =  387764.34 kB/sec
>>  Max throughput per process  =  399655.47 kB/sec
>>  Avg throughput per process  =  394381.76 kB/sec
>>  Min xfer= 1017344.00 kB
>>  CPU Utilization: Wall time2.671CPU time1.974CPU 
>> utilization  73.89 %
>>  Children see throughput for 12 rewriters= 4837741.94 kB/sec
>>  Parent sees throughput for 12 rewriters = 4833509.35 kB/sec
>>  Min throughput per process  =  398983.72 kB/sec
>>  Max throughput per process  =  406199.66 kB/sec
>>  Avg throughput per process  =  403145.16 kB/sec
>>  Min xfer= 1030656.00 kB
>>  CPU utilization: Wall time2.584CPU time1.959CPU 
>> utilization  75.82 %
>>  Children see throughput for 12 readers  = 5921370.94 kB/sec
>>  Parent sees throughput for 12 readers   = 5914106.69 kB/sec
>>  Min throughput per process  =  491812.38 kB/sec
>>  Max throughput per process  =  494777.28 kB/sec
>>  Avg throughput per process  =  493447.58 kB/sec
>>  Min xfer= 1042688.00 kB
>>  CPU utilization: Wall time2.122CPU time1.968CPU 
>> utilization  92.75 %
>>  Children see throughput for 12 re-readers   = 5947985.69 kB/sec
>>  Parent sees throughput for 12 re-readers= 5941348.51 kB/sec
>>  Min throughput per process  =  492805.81 kB/sec
>>  Max throughput per process  =  497280.19 kB/sec
>>  Avg throughput per process  =  495665.47 kB/sec
>>  Min xfer= 1039360.00 kB
>>  CPU utilization: Wall time2.111CPU time1.968CPU 
>> utilization  93.22 %
>> 
>> Here's c062db039f40 ("iommu/vt-d: Update domain geometry in
>> iommu_ops.at(de)tach_dev"). It's losing some steam here.
>> 
>>  Children see throughput for 12 initial writers  = 4342419.12 kB/sec
>>  Parent sees throughput for 12 initial writers   = 4310612.79 kB/sec
>>  Min throughput per process  =  359299.06 kB/sec
>>  Max throughput per process  =  363866.16 kB/sec
>>  Avg throughput per process  =  361868.26 kB/sec
>>  Min xfer= 1035520.00 kB
>>  CPU Utilization: Wall time2.902CPU time1.951CPU 
>> utilization  67.22 %
>>  Children see throughput for 12 rewriters= 4408576.66 kB/sec
>>  Parent sees throughput for 12 rewriters = 4404280.87 kB/sec
>>  Min throughput per process  =  364553.88 kB/sec
>>  Max throughput per process  =  370029.28 kB/sec
>>  Avg throughput per process  =  367381.39 kB/sec
>>  Min xfer= 1033216.00 kB
>>  CPU utilization: Wall time2.836CPU time1.956CPU 
>> utilization  68.97 %
>>  Children see throughput for 12 readers  = 5406879.47 kB/sec
>>  Parent sees throughput for 12 readers   = 5401862.78 kB/sec
>>  Min throughput per process  =  449583.03 kB/sec
>>  Max throughput per process  =  451761.69 kB/sec
>>  Avg 

Re: [PATCH v3 5/7] iommu: Allow io_pgtable_tlb ops optional

2021-01-18 Thread Robin Murphy

On 2020-12-16 10:36, Yong Wu wrote:

This patch allows io_pgtable_tlb ops could be null since the IOMMU drivers
may use the tlb ops from iommu framework.


There's not much in it, but I guess this does make more sense overall 
than just making .tlb_flush_all optional and drivers having to provide a 
full set of NULL callbacks.


Reviewed-by: Robin Murphy 


Signed-off-by: Yong Wu 
---
  include/linux/io-pgtable.h | 8 +---
  1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
index adde9e49be08..c81796814afa 100644
--- a/include/linux/io-pgtable.h
+++ b/include/linux/io-pgtable.h
@@ -206,14 +206,16 @@ struct io_pgtable {
  
  static inline void io_pgtable_tlb_flush_all(struct io_pgtable *iop)

  {
-   iop->cfg.tlb->tlb_flush_all(iop->cookie);
+   if (iop->cfg.tlb && iop->cfg.tlb->tlb_flush_all)
+   iop->cfg.tlb->tlb_flush_all(iop->cookie);
  }
  
  static inline void

  io_pgtable_tlb_flush_walk(struct io_pgtable *iop, unsigned long iova,
  size_t size, size_t granule)
  {
-   iop->cfg.tlb->tlb_flush_walk(iova, size, granule, iop->cookie);
+   if (iop->cfg.tlb && iop->cfg.tlb->tlb_flush_walk)
+   iop->cfg.tlb->tlb_flush_walk(iova, size, granule, iop->cookie);
  }
  
  static inline void

@@ -221,7 +223,7 @@ io_pgtable_tlb_add_page(struct io_pgtable *iop,
struct iommu_iotlb_gather * gather, unsigned long iova,
size_t granule)
  {
-   if (iop->cfg.tlb->tlb_add_page)
+   if (iop->cfg.tlb && iop->cfg.tlb->tlb_add_page)
iop->cfg.tlb->tlb_add_page(gather, iova, granule, iop->cookie);
  }
  


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v6 06/33] of/device: Move dma_range_map before of_iommu_configure

2021-01-18 Thread Robin Murphy

On 2021-01-15 05:30, Yong Wu wrote:

On Thu, 2021-01-14 at 13:27 -0600, Rob Herring wrote:

On Mon, Jan 11, 2021 at 07:18:47PM +0800, Yong Wu wrote:

"dev->dma_range_map" contains the devices' dma_ranges information,
This patch moves dma_range_map before of_iommu_configure. The iommu
driver may need to know the dma_address requirements of its iommu
consumer devices.

CC: Rob Herring 
CC: Frank Rowand 
Signed-off-by: Yong Wu 
---
  drivers/of/device.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/of/device.c b/drivers/of/device.c
index aedfaaafd3e7..1d84636149df 100644
--- a/drivers/of/device.c
+++ b/drivers/of/device.c
@@ -170,9 +170,11 @@ int of_dma_configure_id(struct device *dev, struct 
device_node *np,
dev_dbg(dev, "device is%sdma coherent\n",
coherent ? " " : " not ");
  
+	dev->dma_range_map = map;

iommu = of_iommu_configure(dev, np, id);
if (PTR_ERR(iommu) == -EPROBE_DEFER) {
kfree(map);
+   dev->dma_range_map = NULL;


Not really going to matter, but you should probably clear dma_range_map
before what it points to is freed.

With that,

Reviewed-by: Rob Herring 


Thanks for the review. I will move it before "kfree(map)" in next
version.


Paul noticed that we already have a bug in assigning to this 
unconditionally[1] - I'd totally forgotten about this series when I 
theorised about IOMMU drivers wanting the information earlier, but 
sweeping my inbox now only goes to show I was right to think of it :)


We should really get something in as a fix independent of this series, 
taking both angles into account.


Robin.

[1] 
https://lore.kernel.org/linux-arm-kernel/5c7946f3-b56e-da00-a750-be097c7ce...@arm.com/





return -EPROBE_DEFER;
}
  
@@ -181,7 +183,6 @@ int of_dma_configure_id(struct device *dev, struct device_node *np,
  
  	arch_setup_dma_ops(dev, dma_start, size, iommu, coherent);
  
-	dev->dma_range_map = map;

return 0;
  }
  EXPORT_SYMBOL_GPL(of_dma_configure_id);
--
2.18.0



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] swiotlb: Validate bounce size in the sync/unmap path

2021-01-18 Thread Konrad Rzeszutek Wilk
On Mon, Jan 18, 2021 at 12:44:58PM +0100, Martin Radev wrote:
> On Wed, Jan 13, 2021 at 12:30:17PM +0100, Christoph Hellwig wrote:
> > On Tue, Jan 12, 2021 at 04:07:29PM +0100, Martin Radev wrote:
> > > The size of the buffer being bounced is not checked if it happens
> > > to be larger than the size of the mapped buffer. Because the size
> > > can be controlled by a device, as it's the case with virtio devices,
> > > this can lead to memory corruption.
> > > 
> > 
> > I'm really worried about all these hodge podge hacks for not trusted
> > hypervisors in the I/O stack.  Instead of trying to harden protocols
> > that are fundamentally not designed for this, how about instead coming
> > up with a new paravirtualized I/O interface that is specifically
> > designed for use with an untrusted hypervisor from the start?
> 
> Your comment makes sense but then that would require the cooperation
> of these vendors and the cloud providers to agree on something meaningful.
> I am also not sure whether the end result would be better than hardening
> this interface to catch corruption. There is already some validation in
> unmap path anyway.
> 
> Another possibility is to move this hardening to the common virtio code,
> but I think the code may become more complicated there since it would
> require tracking both the dma_addr and length for each descriptor.

Christoph,

I've been wrestling with the same thing - this is specific to busted
drivers. And in reality you could do the same thing with a hardware
virtio device (see example in http://thunderclap.io/) - where the
mitigation is 'enable the IOMMU to do its job.'.

AMD SEV documents speak about utilizing IOMMU to do this (AMD SEV-SNP)..
and while that is great in the future, SEV without IOMMU is now here.

Doing a full circle here, this issue can be exploited with virtio
but you could say do that with real hardware too if you hacked the
firmware, so if you say used Intel SR-IOV NIC that was compromised
on an AMD SEV machine, and plumbed in the guest - the IOMMU inside
of the guest would be SWIOTLB code. Last line of defense against
bad firmware to say.

As such I am leaning towards taking this code, but I am worried
about the performance hit .. but perhaps I shouldn't as if you
are using SWIOTLB=force already you are kind of taking a
performance hit?

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 2/3] iommu/iova: Avoid double-negatives in magazine helpers

2021-01-18 Thread John Garry

On 18/01/2021 12:59, Robin Murphy wrote:

for cpu_rcaches too, and get a similar abort at runtime.

It's not specifically that we expect them (allocation failures for the
loaded magazine), rather we should make safe against it.

So could you be more specific in your concern for the cpu_rcache 
failure?

cpu_rcache magazine assignment comes from this logic.

If this fails:

drivers/iommu/iova.c:847: rcache->cpu_rcaches = 
__alloc_percpu(sizeof(*cpu_rcache), cache_line_size());


then we'll get an Oops in __iova_rcache_get(). So if we're making the
module safer against magazine allocation failure, shouldn't we also
protect against cpu_rcaches allocation failure?


Ah, gotcha. So we have the WARN there, but that's not much use as this 
would still crash, as you say.


So maybe we can embed the cpu rcaches in iova_domain struct, to avoid 
the separate (failable) cpu rcache allocation.


Is that even possible? The size of percpu data isn't known at compile 
time, so at best it would add ugly runtime complexity to any allocation 
of a struct iova_domain by itself, but worse than that it means that 
embedding iova_domain in any other structure becomes completely broken, no?


Ah, now I see that it's not possible. I was thinking of using 
DEFINE_PER_CPU(), but it's not permitted.


So even though this patch saves us from cpu_rcache->loaded / ->prev == 
NULL, I still prefer not to add explicit checks for cpu_rcache == NULL 
in the IOVA alloc/free paths, and would rather pass an error back in 
init_iova_rcaches(), but adding code for tidy-up looks messy.


Thanks,
John
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 6/6] vfio/iommu_type1: Drop parameter "pgsize" of update_user_bitmap

2021-01-18 Thread Keqian Zhu



On 2021/1/16 7:44, Alex Williamson wrote:
> On Thu, 7 Jan 2021 12:44:01 +0800
> Keqian Zhu  wrote:
> 
>> We always use the smallest supported page size of vfio_iommu as
>> pgsize. Drop parameter "pgsize" of update_user_bitmap.
>>
>> Signed-off-by: Keqian Zhu 
>> ---
>>  drivers/vfio/vfio_iommu_type1.c | 9 -
>>  1 file changed, 4 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c 
>> b/drivers/vfio/vfio_iommu_type1.c
>> index 82649a040148..bceda5e8baaa 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -978,10 +978,9 @@ static void vfio_update_pgsize_bitmap(struct vfio_iommu 
>> *iommu)
>>  }
>>  
>>  static int update_user_bitmap(u64 __user *bitmap, struct vfio_iommu *iommu,
>> -  struct vfio_dma *dma, dma_addr_t base_iova,
>> -  size_t pgsize)
>> +  struct vfio_dma *dma, dma_addr_t base_iova)
>>  {
>> -unsigned long pgshift = __ffs(pgsize);
>> +unsigned long pgshift = __ffs(iommu->pgsize_bitmap);
>>  unsigned long nbits = dma->size >> pgshift;
>>  unsigned long bit_offset = (dma->iova - base_iova) >> pgshift;
>>  unsigned long copy_offset = bit_offset / BITS_PER_LONG;
>> @@ -1046,7 +1045,7 @@ static int vfio_iova_dirty_bitmap(u64 __user *bitmap, 
>> struct vfio_iommu *iommu,
>>  if (dma->iova > iova + size - 1)
>>  break;
>>  
>> -ret = update_user_bitmap(bitmap, iommu, dma, iova, pgsize);
>> +ret = update_user_bitmap(bitmap, iommu, dma, iova);
>>  if (ret)
>>  return ret;
>>  
>> @@ -1192,7 +1191,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>>  
>>  if (unmap->flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) {
>>  ret = update_user_bitmap(bitmap->data, iommu, dma,
>> - unmap->iova, pgsize);
>> + unmap->iova);
>>  if (ret)
>>  break;
>>  }
> 
> Same as the previous, both call sites already have both pgsize and
> pgshift, pass both rather than recalculate.  Thanks,
> 
My idea is that the "pgsize" parameter goes here and there, disturbs our sight 
when read code.
To be frankly, the recalculate is negligible. Or we can add new fields in 
vfio_iommu, which are
updated in vfio_update_pgsize_bitmap().

Thanks,
Keqian



> Alex
> 
> .
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 3/6] vfio/iommu_type1: Initially set the pinned_page_dirty_scope

2021-01-18 Thread Keqian Zhu



On 2021/1/16 7:30, Alex Williamson wrote:
> On Thu, 7 Jan 2021 12:43:58 +0800
> Keqian Zhu  wrote:
> 
>> For now there are 3 ways to promote the pinned_page_dirty_scope
>> status of vfio_iommu:
>>
>> 1. Through vfio pin interface.
>> 2. Detach a group without pinned_dirty_scope.
>> 3. Attach a group with pinned_dirty_scope.
>>
>> For point 3, the only chance to promote the pinned_page_dirty_scope
>> status is when vfio_iommu is newly created. As we can safely set
>> empty vfio_iommu to be at pinned status, then the point 3 can be
>> removed to reduce operations.
>>
>> Signed-off-by: Keqian Zhu 
>> ---
>>  drivers/vfio/vfio_iommu_type1.c | 4 +---
>>  1 file changed, 1 insertion(+), 3 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c 
>> b/drivers/vfio/vfio_iommu_type1.c
>> index 110ada24ee91..b596c482487b 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -2045,11 +2045,8 @@ static int vfio_iommu_type1_attach_group(void 
>> *iommu_data,
>>   * Non-iommu backed group cannot dirty memory directly,
>>   * it can only use interfaces that provide dirty
>>   * tracking.
>> - * The iommu scope can only be promoted with the
>> - * addition of a dirty tracking group.
>>   */
>>  group->pinned_page_dirty_scope = true;
>> -promote_pinned_page_dirty_scope(iommu);
>>  mutex_unlock(>lock);
>>  
>>  return 0;
>> @@ -2436,6 +2433,7 @@ static void *vfio_iommu_type1_open(unsigned long arg)
>>  INIT_LIST_HEAD(>iova_list);
>>  iommu->dma_list = RB_ROOT;
>>  iommu->dma_avail = dma_entry_limit;
>> +iommu->pinned_page_dirty_scope = true;
>>  mutex_init(>lock);
>>  BLOCKING_INIT_NOTIFIER_HEAD(>notifier);
>>  
> 
> This would be resolved automatically if we used the counter approach I
> mentioned on the previous patch, adding a pinned-page scope group simply
> wouldn't increment the iommu counter, which would initially be zero
> indicating no "all-dma" groups.  Thanks,
> 
Sure, I will do that, thanks.

Keqian.

> Alex
> 
> .
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/6] vfio/iommu_type1: Make an explicit "promote" semantic

2021-01-18 Thread Keqian Zhu



On 2021/1/16 6:42, Alex Williamson wrote:
> On Thu, 7 Jan 2021 12:43:56 +0800
> Keqian Zhu  wrote:
> 
>> When we want to promote the pinned_page_dirty_scope of vfio_iommu,
>> we call the "update" function to visit all vfio_group, but when we
>> want to downgrade this, we can set the flag as false directly.
> 
> I agree that the transition can only go in one direction, but it's
> still conditional on the scope of all groups involved.  We are
> "updating" the iommu state based on the change of a group.  Renaming
> this to "promote" seems like a matter of personal preference.
> 
>> So we'd better make an explicit "promote" semantic to the "update"
>> function. BTW, if vfio_iommu already has been promoted, then return
>> early.
> 
> Currently it's the caller that avoids using this function when the
> iommu scope is already correct.  In fact the changes induces a
> redundant test in the pin_pages code path, we're changing a group from
> non-pinned-page-scope to pinned-page-scope, therefore the iommu scope
> cannot initially be scope limited.  In the attach_group call path,
> we're moving that test from the caller, so at best we've introduced an
> additional function call.
> 
> The function as it exists today is also more versatile whereas the
> "promote" version here forces it to a single task with no appreciable
> difference in complexity or code.  This seems like a frivolous change.
> Thanks,
OK, I will adapt your idea that maintenance a counter of non-pinned groups.
Then we keep the "update" semantic, and the target is the counter ;-).

Thanks,
Keqian

> 
> Alex
> 
>> Signed-off-by: Keqian Zhu 
>> ---
>>  drivers/vfio/vfio_iommu_type1.c | 30 ++
>>  1 file changed, 14 insertions(+), 16 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c 
>> b/drivers/vfio/vfio_iommu_type1.c
>> index 0b4dedaa9128..334a8240e1da 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -148,7 +148,7 @@ static int put_pfn(unsigned long pfn, int prot);
>>  static struct vfio_group *vfio_iommu_find_iommu_group(struct vfio_iommu 
>> *iommu,
>> struct iommu_group *iommu_group);
>>  
>> -static void update_pinned_page_dirty_scope(struct vfio_iommu *iommu);
>> +static void promote_pinned_page_dirty_scope(struct vfio_iommu *iommu);
>>  /*
>>   * This code handles mapping and unmapping of user data buffers
>>   * into DMA'ble space using the IOMMU
>> @@ -714,7 +714,7 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>>  group = vfio_iommu_find_iommu_group(iommu, iommu_group);
>>  if (!group->pinned_page_dirty_scope) {
>>  group->pinned_page_dirty_scope = true;
>> -update_pinned_page_dirty_scope(iommu);
>> +promote_pinned_page_dirty_scope(iommu);
>>  }
>>  
>>  goto pin_done;
>> @@ -1622,27 +1622,26 @@ static struct vfio_group 
>> *vfio_iommu_find_iommu_group(struct vfio_iommu *iommu,
>>  return group;
>>  }
>>  
>> -static void update_pinned_page_dirty_scope(struct vfio_iommu *iommu)
>> +static void promote_pinned_page_dirty_scope(struct vfio_iommu *iommu)
>>  {
>>  struct vfio_domain *domain;
>>  struct vfio_group *group;
>>  
>> +if (iommu->pinned_page_dirty_scope)
>> +return;
>> +
>>  list_for_each_entry(domain, >domain_list, next) {
>>  list_for_each_entry(group, >group_list, next) {
>> -if (!group->pinned_page_dirty_scope) {
>> -iommu->pinned_page_dirty_scope = false;
>> +if (!group->pinned_page_dirty_scope)
>>  return;
>> -}
>>  }
>>  }
>>  
>>  if (iommu->external_domain) {
>>  domain = iommu->external_domain;
>>  list_for_each_entry(group, >group_list, next) {
>> -if (!group->pinned_page_dirty_scope) {
>> -iommu->pinned_page_dirty_scope = false;
>> +if (!group->pinned_page_dirty_scope)
>>  return;
>> -}
>>  }
>>  }
>>  
>> @@ -2057,8 +2056,7 @@ static int vfio_iommu_type1_attach_group(void 
>> *iommu_data,
>>   * addition of a dirty tracking group.
>>   */
>>  group->pinned_page_dirty_scope = true;
>> -if (!iommu->pinned_page_dirty_scope)
>> -update_pinned_page_dirty_scope(iommu);
>> +promote_pinned_page_dirty_scope(iommu);
>>  mutex_unlock(>lock);
>>  
>>  return 0;
>> @@ -2341,7 +2339,7 @@ static void vfio_iommu_type1_detach_group(void 
>> *iommu_data,
>>  struct vfio_iommu *iommu = iommu_data;
>>  struct vfio_domain *domain;
>>  struct vfio_group *group;
>> -bool update_dirty_scope = false;
>> +bool promote_dirty_scope = false;
>> 

Re: [PATCH v2 2/2] vfio/iommu_type1: Sanity check pfn_list when remove vfio_dma

2021-01-18 Thread Keqian Zhu



On 2021/1/16 3:14, Alex Williamson wrote:
> On Fri, 15 Jan 2021 17:26:43 +0800
> Keqian Zhu  wrote:
> 
>> vfio_sanity_check_pfn_list() is used to check whether pfn_list of
>> vfio_dma is empty when remove the external domain, so it makes a
>> wrong assumption that only external domain will add pfn to dma pfn_list.
>>
>> Now we apply this check when remove a specific vfio_dma and extract
>> the notifier check just for external domain.
> 
> The page pinning interface is gated by having a notifier registered for
> unmaps, therefore non-external domains would also need to register a
> notifier.  There's currently no other way to add entries to the
> pfn_list.  So if we allow pinning for such domains, then it's wrong to
> WARN_ON() when the notifier list is not-empty when removing an external
> domain.  Long term we should probably extend page {un}pinning for the
> caller to pass their notifier to be validated against the notifier list
> rather than just allowing page pinning if *any* notifier is registered.
> Thanks,
I was misled by the code comments. So when the commit a54eb55045ae is added, 
the only
user of pin interface is mdev vendor driver, but now we also allow iommu backed 
group
to use this interface to constraint dirty scope. Is 
vfio_iommu_unmap_unpin_all() a
proper place to put this WARN()?

Thanks,
Keqian

> 
> Alex
>  
>> Fixes: a54eb55045ae ("vfio iommu type1: Add support for mediated devices")
>> Signed-off-by: Keqian Zhu 
>> ---
>>  drivers/vfio/vfio_iommu_type1.c | 24 +---
>>  1 file changed, 5 insertions(+), 19 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c 
>> b/drivers/vfio/vfio_iommu_type1.c
>> index 4e82b9a3440f..a9bc15e84a4e 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -958,6 +958,7 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, 
>> struct vfio_dma *dma,
>>  
>>  static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
>>  {
>> +WARN_ON(!RB_EMPTY_ROOT(>pfn_list);
>>  vfio_unmap_unpin(iommu, dma, true);
>>  vfio_unlink_dma(iommu, dma);
>>  put_task_struct(dma->task);
>> @@ -2251,23 +2252,6 @@ static void vfio_iommu_unmap_unpin_reaccount(struct 
>> vfio_iommu *iommu)
>>  }
>>  }
>>  
>> -static void vfio_sanity_check_pfn_list(struct vfio_iommu *iommu)
>> -{
>> -struct rb_node *n;
>> -
>> -n = rb_first(>dma_list);
>> -for (; n; n = rb_next(n)) {
>> -struct vfio_dma *dma;
>> -
>> -dma = rb_entry(n, struct vfio_dma, node);
>> -
>> -if (WARN_ON(!RB_EMPTY_ROOT(>pfn_list)))
>> -break;
>> -}
>> -/* mdev vendor driver must unregister notifier */
>> -WARN_ON(iommu->notifier.head);
>> -}
>> -
>>  /*
>>   * Called when a domain is removed in detach. It is possible that
>>   * the removed domain decided the iova aperture window. Modify the
>> @@ -2367,7 +2351,8 @@ static void vfio_iommu_type1_detach_group(void 
>> *iommu_data,
>>  kfree(group);
>>  
>>  if (list_empty(>external_domain->group_list)) {
>> -vfio_sanity_check_pfn_list(iommu);
>> +/* mdev vendor driver must unregister notifier 
>> */
>> +WARN_ON(iommu->notifier.head);
>>  
>>  if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu))
>>  vfio_iommu_unmap_unpin_all(iommu);
>> @@ -2491,7 +2476,8 @@ static void vfio_iommu_type1_release(void *iommu_data)
>>  
>>  if (iommu->external_domain) {
>>  vfio_release_domain(iommu->external_domain, true);
>> -vfio_sanity_check_pfn_list(iommu);
>> +/* mdev vendor driver must unregister notifier */
>> +WARN_ON(iommu->notifier.head);
>>  kfree(iommu->external_domain);
>>  }
>>  
> 
> .
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 2/3] iommu/iova: Avoid double-negatives in magazine helpers

2021-01-18 Thread Robin Murphy

On 2021-01-18 10:55, John Garry wrote:

On 18/01/2021 10:08, Jean-Philippe Brucker wrote:
Any idea why that's happening?  This fix seems ok but if we're 
expecting

allocation failures for the loaded magazine then we could easily get it
for cpu_rcaches too, and get a similar abort at runtime.

It's not specifically that we expect them (allocation failures for the
loaded magazine), rather we should make safe against it.

So could you be more specific in your concern for the cpu_rcache 
failure?

cpu_rcache magazine assignment comes from this logic.

If this fails:

drivers/iommu/iova.c:847: rcache->cpu_rcaches = 
__alloc_percpu(sizeof(*cpu_rcache), cache_line_size());


then we'll get an Oops in __iova_rcache_get(). So if we're making the
module safer against magazine allocation failure, shouldn't we also
protect against cpu_rcaches allocation failure?


Ah, gotcha. So we have the WARN there, but that's not much use as this 
would still crash, as you say.


So maybe we can embed the cpu rcaches in iova_domain struct, to avoid 
the separate (failable) cpu rcache allocation.


Is that even possible? The size of percpu data isn't known at compile 
time, so at best it would add ugly runtime complexity to any allocation 
of a struct iova_domain by itself, but worse than that it means that 
embedding iova_domain in any other structure becomes completely broken, no?


Robin.

Alternatively, we could add NULL checks __iova_rcache_get() et al for 
this allocation failure but that's not preferable as it's fastpath.


Finally so we could pass back an error code from init_iova_rcache() to 
its only caller, init_iova_domain(); but that has multiple callers and 
would need to be fixed up.


Not sure which is best or on other options.

Thanks,
John

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v4 2/3] iommu/iova: Avoid double-negatives in magazine helpers

2021-01-18 Thread Jean-Philippe Brucker
On Mon, Jan 18, 2021 at 10:55:52AM +, John Garry wrote:
> On 18/01/2021 10:08, Jean-Philippe Brucker wrote:
> > > > Any idea why that's happening?  This fix seems ok but if we're expecting
> > > > allocation failures for the loaded magazine then we could easily get it
> > > > for cpu_rcaches too, and get a similar abort at runtime.
> > > It's not specifically that we expect them (allocation failures for the
> > > loaded magazine), rather we should make safe against it.
> > > 
> > > So could you be more specific in your concern for the cpu_rcache failure?
> > > cpu_rcache magazine assignment comes from this logic.
> > If this fails:
> > 
> > drivers/iommu/iova.c:847: rcache->cpu_rcaches = 
> > __alloc_percpu(sizeof(*cpu_rcache), cache_line_size());
> > 
> > then we'll get an Oops in __iova_rcache_get(). So if we're making the
> > module safer against magazine allocation failure, shouldn't we also
> > protect against cpu_rcaches allocation failure?
> 
> Ah, gotcha. So we have the WARN there, but that's not much use as this would
> still crash, as you say.
> 
> So maybe we can embed the cpu rcaches in iova_domain struct, to avoid the
> separate (failable) cpu rcache allocation.
> 
> Alternatively, we could add NULL checks __iova_rcache_get() et al for this
> allocation failure but that's not preferable as it's fastpath.
> 
> Finally so we could pass back an error code from init_iova_rcache() to its
> only caller, init_iova_domain(); but that has multiple callers and would
> need to be fixed up.
> 
> Not sure which is best or on other options.

I would have initially gone with option 2 which seems the simplest, but I
don't have a setup to measure that overhead

Thanks,
Jean
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 3/3] iommu/iova: Flush CPU rcache for when a depot fills

2021-01-18 Thread John Garry

On 15/01/2021 19:21, Robin Murphy wrote:


It would be good to understand why the rcache doesn't stabilize. Could be
a bug, or just need some tuning

In strict mode, if a driver does Alloc-Free-Alloc and the first alloc
misses the rcache, the second allocation hits it. The same sequence in
non-strict mode misses the cache twice, because the IOVA is added to the
flush queue on Free.

So rather than AFAFAF.. we get AAA..FFF.., only once the fq_timer 
triggers

or the FQ is full.


Sounds right


Interestingly the FQ size is 2x IOVA_MAG_SIZE, so we
could allocate 2 magazines worth of fresh IOVAs before alloc starts
hitting the cache. If a job allocates more than that, some magazines are
going to the depot, and with multi-CPU jobs those will get used on other
CPUs during the next alloc bursts, causing the progressive increase in
rcache consumption. I wonder if setting IOVA_MAG_SIZE > IOVA_FQ_SIZE 
helps

reuse of IOVAs?


Looking back through the lore history, I don't know where the 
IOVA_FQ_SIZE = 256 came from. I guess it's size of 2x IOVA_MAG_SIZE (1x 
for loaded and 1x for prev) for the reason you mention.




Then again I haven't worked out the details, might be entirely wrong. 
I'll

have another look next week.




cheers

I did start digging into the data (thanks for that!) before Christmas, 
but between being generally frazzled and trying to remember how to write 
Perl to massage the numbers out of the log dump I never got round to 
responding, sorry.


As you may have seen:
https://raw.githubusercontent.com/hisilicon/kernel-dev/064c4dc8869b3f2ad07edffceafde0b129f276b0/lsi3008_dmesg

I had to change some block configs via sysfs to ever get IOVA locations 
for size > 0. And even then, I still got none bigger than 
IOVA_RANGE_CACHE_MAX_SIZE.


Note: For a log like:
[13175.361915] print_iova2 iova_allocs(=500 ... too_big=47036

47036 is number of IOVA size > IOVA_RANGE_CACHE_MAX_SIZE, in case it was 
not clear.


And I never hit the critical point of a depot bin filling, but it may 
just take even longer.


However with IOVA size = 0 always occurring, then I noticed that the 
depot size = 0 bin fills up relatively quickly. As such, I am now 
slightly skeptical of the approach I have taken here, i.e purge the 
whole rcache.




The partial thoughts that I can recall right now are firstly that the 
total numbers of IOVAs are actually pretty meaningless, it really needs 
to be broken down by size (that's where my Perl-hacking stalled...); 
secondly that the pattern is far more than just a steady increase - the 
CPU rcache count looks to be heading asymptotically towards ~65K IOVAs 
all the time, representing (IIRC) two sizes being in heavy rotation, 
while the depot is happily ticking along in a steady state as expected, 
until it suddenly explodes out of nowhere; thirdly, I'd really like to 
see instrumentation of the flush queues at the same time, since I think 
they're the real culprit.


My theory so far is that everyone is calling queue_iova() frequently 
enough to keep the timer at bay and their own queues drained. Then at 
the ~16H mark, *something* happens that pauses unmaps long enough for 
the timer to fire, and at that point all hell breaks loose.


So do you think that the freeing the IOVA magazines when the depot fills 
is the cause of this? That was our analysis.


The depot is 
suddenly flooded with IOVAs of *all* sizes, indicative of all the queues 
being flushed at once (note that the two most common sizes have been 
hovering perilously close to "full" the whole time), but then, 
crucially, *that keeps happening*. My guess is that the load of 
fq_flush_timeout() slows things down enough that the the timer then 
keeps getting the chance to expire and repeat the situation.


Not sure on that one.



The main conclusion I draw from this is the same one that was my initial 
gut feeling; that MAX_GLOBAL_MAGS = 32 is utter bollocks.


Yeah, I tend to agree with that. Or, more specifically, how things work 
today is broken, and MAX_GLOBAL_MAGS = 32 is very much involved with that.


The CPU rcache 
capacity scales with the number of CPUs; the flush queue capacity scales 
with the number of CPUs; it is nonsensical that the depot size does not 
correspondingly scale with the number of CPUs (I note that the testing 
on the original patchset cites a 16-CPU system, where that depot 
capacity is conveniently equal to the total rcache capacity).


Now yes, purging the rcaches when the depot is full does indeed help 
mitigate this scenario - I assume it provides enough of a buffer where 
the regular free_iova_fast() calls don't hit queue_iova() for a while 
(and gives fq_ring_free() some reprieve on the CPU handling the 
timeout), giving enough leeway for the flood to finish before anyone 
starts hitting queues/locks/etc. and stalling again, and thus break the 
self-perpetuating timeout cycle. But that's still only a damage 
limitation exercise! It's planning for failure to just lie down and 
assume 

Re: [PATCH v2 1/2] vfio/iommu_type1: Populate full dirty when detach non-pinned group

2021-01-18 Thread Keqian Zhu



On 2021/1/16 2:01, Alex Williamson wrote:
> On Fri, 15 Jan 2021 17:26:42 +0800
> Keqian Zhu  wrote:
> 
>> If a group with non-pinned-page dirty scope is detached with dirty
>> logging enabled, we should fully populate the dirty bitmaps at the
>> time it's removed since we don't know the extent of its previous DMA,
>> nor will the group be present to trigger the full bitmap when the user
>> retrieves the dirty bitmap.
>>
>> Fixes: d6a4c185660c ("vfio iommu: Implementation of ioctl for dirty pages 
>> tracking")
>> Suggested-by: Alex Williamson 
>> Signed-off-by: Keqian Zhu 
>> ---
>>  drivers/vfio/vfio_iommu_type1.c | 18 +-
>>  1 file changed, 17 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c 
>> b/drivers/vfio/vfio_iommu_type1.c
>> index 0b4dedaa9128..4e82b9a3440f 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -236,6 +236,19 @@ static void vfio_dma_populate_bitmap(struct vfio_dma 
>> *dma, size_t pgsize)
>>  }
>>  }
>>  
>> +static void vfio_iommu_populate_bitmap_full(struct vfio_iommu *iommu)
>> +{
>> +struct rb_node *n;
>> +unsigned long pgshift = __ffs(iommu->pgsize_bitmap);
>> +
>> +for (n = rb_first(>dma_list); n; n = rb_next(n)) {
>> +struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
>> +
>> +if (dma->iommu_mapped)
>> +bitmap_set(dma->bitmap, 0, dma->size >> pgshift);
>> +}
>> +}
>> +
>>  static int vfio_dma_bitmap_alloc_all(struct vfio_iommu *iommu, size_t 
>> pgsize)
>>  {
>>  struct rb_node *n;
>> @@ -2415,8 +2428,11 @@ static void vfio_iommu_type1_detach_group(void 
>> *iommu_data,
>>   * Removal of a group without dirty tracking may allow the iommu scope
>>   * to be promoted.
>>   */
>> -if (update_dirty_scope)
>> +if (update_dirty_scope) {
>>  update_pinned_page_dirty_scope(iommu);
>> +if (iommu->dirty_page_tracking)
>> +vfio_iommu_populate_bitmap_full(iommu);
>> +}
>>  mutex_unlock(>lock);
>>  }
>>  
> 
> This doesn't do the right thing.  This marks the bitmap dirty if:
> 
>  * The detached group dirty scope was not limited to pinned pages
> 
>  AND
> 
>  * Dirty tracking is enabled
> 
>  AND
> 
>  * The vfio_dma is *currently* (ie. after the detach) iommu_mapped
> 
> We need to mark the bitmap dirty based on whether the vfio_dma *was*
> iommu_mapped by the group that is now detached.  Thanks,
> 
> Alex
> 
Hi Alex,

Yes, I missed this point again :-(. The update_dirty_scope means we detached
an iommu backed group, and that means the vfio_dma *was* iommu_mapped by this
group, so we can populate full bitmap unconditionally, right?

Thanks,
Keqian


> .
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] swiotlb: Validate bounce size in the sync/unmap path

2021-01-18 Thread Martin Radev
On Wed, Jan 13, 2021 at 12:30:17PM +0100, Christoph Hellwig wrote:
> On Tue, Jan 12, 2021 at 04:07:29PM +0100, Martin Radev wrote:
> > The size of the buffer being bounced is not checked if it happens
> > to be larger than the size of the mapped buffer. Because the size
> > can be controlled by a device, as it's the case with virtio devices,
> > this can lead to memory corruption.
> > 
> 
> I'm really worried about all these hodge podge hacks for not trusted
> hypervisors in the I/O stack.  Instead of trying to harden protocols
> that are fundamentally not designed for this, how about instead coming
> up with a new paravirtualized I/O interface that is specifically
> designed for use with an untrusted hypervisor from the start?

Your comment makes sense but then that would require the cooperation
of these vendors and the cloud providers to agree on something meaningful.
I am also not sure whether the end result would be better than hardening
this interface to catch corruption. There is already some validation in
unmap path anyway.

Another possibility is to move this hardening to the common virtio code,
but I think the code may become more complicated there since it would
require tracking both the dma_addr and length for each descriptor.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RESEND PATCH 1/1] iommu/amd: Remove unnecessary assignment

2021-01-18 Thread Will Deacon
On Thu, Dec 10, 2020 at 10:13:30AM +0800, Adrian Huang wrote:
> From: Adrian Huang 
> 
> From: Adrian Huang 
> 
> The values of local variables are assigned after local variables
> are declared, so no need to assign the initial value during the
> variable declaration.
> 
> And, no need to assign NULL for the local variable 'ivrs_base'
> after invoking acpi_put_table().
> 
> Signed-off-by: Adrian Huang 
> ---
>  drivers/iommu/amd/init.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)

Acked-by: Will Deacon 

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 0/6] IOMMU: Some more IOVA and core code tidy-up

2021-01-18 Thread Will Deacon
On Wed, Jan 06, 2021 at 09:35:05PM +0800, John Garry wrote:
> Just some tidy-up to IOVA and core code.
> 
> Based on v5.11-rc2
> 
> Differences to v1:
> - Add core IOMMU patches
> 
> John Garry (6):
>   iova: Make has_iova_flush_queue() private
>   iova: Delete copy_reserved_iova()
>   iova: Stop exporting some more functions
>   iommu: Stop exporting iommu_map_sg_atomic()
>   iommu: Delete iommu_domain_window_disable()
>   iommu: Delete iommu_dev_has_feature()
> 
>  drivers/iommu/iommu.c | 21 -
>  drivers/iommu/iova.c  | 36 +---
>  include/linux/iommu.h | 13 -
>  include/linux/iova.h  | 12 
>  4 files changed, 1 insertion(+), 81 deletions(-)

For the series:

Acked-by: Will Deacon 

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 2/3] iommu/iova: Avoid double-negatives in magazine helpers

2021-01-18 Thread John Garry

On 18/01/2021 10:08, Jean-Philippe Brucker wrote:

Any idea why that's happening?  This fix seems ok but if we're expecting
allocation failures for the loaded magazine then we could easily get it
for cpu_rcaches too, and get a similar abort at runtime.

It's not specifically that we expect them (allocation failures for the
loaded magazine), rather we should make safe against it.

So could you be more specific in your concern for the cpu_rcache failure?
cpu_rcache magazine assignment comes from this logic.

If this fails:

drivers/iommu/iova.c:847: rcache->cpu_rcaches = 
__alloc_percpu(sizeof(*cpu_rcache), cache_line_size());

then we'll get an Oops in __iova_rcache_get(). So if we're making the
module safer against magazine allocation failure, shouldn't we also
protect against cpu_rcaches allocation failure?


Ah, gotcha. So we have the WARN there, but that's not much use as this 
would still crash, as you say.


So maybe we can embed the cpu rcaches in iova_domain struct, to avoid 
the separate (failable) cpu rcache allocation.


Alternatively, we could add NULL checks __iova_rcache_get() et al for 
this allocation failure but that's not preferable as it's fastpath.


Finally so we could pass back an error code from init_iova_rcache() to 
its only caller, init_iova_domain(); but that has multiple callers and 
would need to be fixed up.


Not sure which is best or on other options.

Thanks,
John
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 2/3] iommu/iova: Avoid double-negatives in magazine helpers

2021-01-18 Thread Jean-Philippe Brucker
On Mon, Jan 18, 2021 at 09:24:17AM +, John Garry wrote:
> On 15/01/2021 17:30, Jean-Philippe Brucker wrote:
> > On Thu, Dec 10, 2020 at 02:23:08AM +0800, John Garry wrote:
> > > A similar crash to the following could be observed if initial CPU rcache
> > > magazine allocations fail in init_iova_rcaches():
> > 
> 
> thanks for having a look
> 
> > Any idea why that's happening?  This fix seems ok but if we're expecting
> > allocation failures for the loaded magazine then we could easily get it
> > for cpu_rcaches too, and get a similar abort at runtime.
> 
> It's not specifically that we expect them (allocation failures for the
> loaded magazine), rather we should make safe against it.
> 
> So could you be more specific in your concern for the cpu_rcache failure?
> cpu_rcache magazine assignment comes from this logic.

If this fails:

drivers/iommu/iova.c:847: rcache->cpu_rcaches = 
__alloc_percpu(sizeof(*cpu_rcache), cache_line_size());

then we'll get an Oops in __iova_rcache_get(). So if we're making the
module safer against magazine allocation failure, shouldn't we also
protect against cpu_rcaches allocation failure?

Thanks,
Jean

> 
> Anyway, logic like "if not full" or "if not empty" is poor as the outcome
> for NULL is ambiguous (maybe there's a better word) and the code is not safe
> against it, and so I replace with "if space" or "if have an IOVA",
> respectively.
> 
> Thanks,
> John
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 2/3] iommu/iova: Avoid double-negatives in magazine helpers

2021-01-18 Thread John Garry

On 15/01/2021 17:30, Jean-Philippe Brucker wrote:

On Thu, Dec 10, 2020 at 02:23:08AM +0800, John Garry wrote:

A similar crash to the following could be observed if initial CPU rcache
magazine allocations fail in init_iova_rcaches():




thanks for having a look


Any idea why that's happening?  This fix seems ok but if we're expecting
allocation failures for the loaded magazine then we could easily get it
for cpu_rcaches too, and get a similar abort at runtime.


It's not specifically that we expect them (allocation failures for the 
loaded magazine), rather we should make safe against it.


So could you be more specific in your concern for the cpu_rcache 
failure? cpu_rcache magazine assignment comes from this logic.


Anyway, logic like "if not full" or "if not empty" is poor as the 
outcome for NULL is ambiguous (maybe there's a better word) and the code 
is not safe against it, and so I replace with "if space" or "if have an 
IOVA", respectively.


Thanks,
John



Thanks,
Jean


Unable to handle kernel NULL pointer dereference at virtual address 

Mem abort info:

   free_iova_fast+0xfc/0x280
   iommu_dma_free_iova+0x64/0x70
   __iommu_dma_unmap+0x9c/0xf8
   iommu_dma_unmap_sg+0xa8/0xc8
   dma_unmap_sg_attrs+0x28/0x50
   cq_thread_v3_hw+0x2dc/0x528
   irq_thread_fn+0x2c/0xa0
   irq_thread+0x130/0x1e0
   kthread+0x154/0x158
   ret_from_fork+0x10/0x34

The issue is that expression !iova_magazine_full(NULL) evaluates true; this
falls over in __iova_rcache_insert() when we attempt to cache a mag and
cpu_rcache->loaded == NULL:

if (!iova_magazine_full(cpu_rcache->loaded)) {
can_insert = true;
...

if (can_insert)
iova_magazine_push(cpu_rcache->loaded, iova_pfn);

As above, can_insert is evaluated true, which it shouldn't be, and we try
to insert pfns in a NULL mag, which is not safe.

To avoid this, stop using double-negatives, like !iova_magazine_full() and
!iova_magazine_empty(), and use positive tests, like
iova_magazine_has_space() and iova_magazine_has_pfns(), respectively; these
can safely deal with cpu_rcache->{loaded, prev} = NULL.

Signed-off-by: John Garry 
Tested-by: Xiang Chen 
Reviewed-by: Zhen Lei 



---
  drivers/iommu/iova.c | 29 +
  1 file changed, 17 insertions(+), 12 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index cf1aacda2fe4..732ee687e0e2 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -767,14 +767,18 @@ iova_magazine_free_pfns(struct iova_magazine *mag, struct 
iova_domain *iovad)
mag->size = 0;
  }
  
-static bool iova_magazine_full(struct iova_magazine *mag)

+static bool iova_magazine_has_space(struct iova_magazine *mag)
  {
-   return (mag && mag->size == IOVA_MAG_SIZE);
+   if (!mag)
+   return false;
+   return mag->size < IOVA_MAG_SIZE;
  }
  
-static bool iova_magazine_empty(struct iova_magazine *mag)

+static bool iova_magazine_has_pfns(struct iova_magazine *mag)
  {
-   return (!mag || mag->size == 0);
+   if (!mag)
+   return false;
+   return mag->size;
  }
  
  static unsigned long iova_magazine_pop(struct iova_magazine *mag,

@@ -783,7 +787,7 @@ static unsigned long iova_magazine_pop(struct iova_magazine 
*mag,
int i;
unsigned long pfn;
  
-	BUG_ON(iova_magazine_empty(mag));

+   BUG_ON(!iova_magazine_has_pfns(mag));
  
  	/* Only fall back to the rbtree if we have no suitable pfns at all */

for (i = mag->size - 1; mag->pfns[i] > limit_pfn; i--)
@@ -799,7 +803,7 @@ static unsigned long iova_magazine_pop(struct iova_magazine 
*mag,
  
  static void iova_magazine_push(struct iova_magazine *mag, unsigned long pfn)

  {
-   BUG_ON(iova_magazine_full(mag));
+   BUG_ON(!iova_magazine_has_space(mag));
  
  	mag->pfns[mag->size++] = pfn;

  }
@@ -845,9 +849,9 @@ static bool __iova_rcache_insert(struct iova_domain *iovad,
cpu_rcache = raw_cpu_ptr(rcache->cpu_rcaches);
spin_lock_irqsave(_rcache->lock, flags);
  
-	if (!iova_magazine_full(cpu_rcache->loaded)) {

+   if (iova_magazine_has_space(cpu_rcache->loaded)) {
can_insert = true;
-   } else if (!iova_magazine_full(cpu_rcache->prev)) {
+   } else if (iova_magazine_has_space(cpu_rcache->prev)) {
swap(cpu_rcache->prev, cpu_rcache->loaded);
can_insert = true;
} else {
@@ -856,8 +860,9 @@ static bool __iova_rcache_insert(struct iova_domain *iovad,
if (new_mag) {
spin_lock(>lock);
if (rcache->depot_size < MAX_GLOBAL_MAGS) {
-   rcache->depot[rcache->depot_size++] =
-   cpu_rcache->loaded;
+   if (cpu_rcache->loaded)
+   rcache->depot[rcache->depot_size++] =
+   cpu_rcache->loaded;

Re: [PATCH] iommu: check for the deferred attach when attaching a device

2021-01-18 Thread lijiang
在 2021年01月15日 23:15, Robin Murphy 写道:
> On 2021-01-15 14:26, lijiang wrote:
>> Hi, Robin
>>
>> Thank you for the comment.
>>
>> 在 2021年01月13日 01:29, Robin Murphy 写道:
>>> On 2021-01-05 07:52, lijiang wrote:
 在 2021年01月05日 11:55, lijiang 写道:
> Hi,
>
> Also add Joerg to cc list.
>

 Also add more people to cc list, Jerry Snitselaar and Tom Lendacky.

 Thanks.

> Thanks.
> Lianbo
> 在 2020年12月26日 13:39, Lianbo Jiang 写道:
>> Currently, because domain attach allows to be deferred from iommu
>> driver to device driver, and when iommu initializes, the devices
>> on the bus will be scanned and the default groups will be allocated.
>>
>> Due to the above changes, some devices could be added to the same
>> group as below:
>>
>> [    3.859417] pci :01:00.0: Adding to iommu group 16
>> [    3.864572] pci :01:00.1: Adding to iommu group 16
>> [    3.869738] pci :02:00.0: Adding to iommu group 17
>> [    3.874892] pci :02:00.1: Adding to iommu group 17
>>
>> But when attaching these devices, it doesn't allow that a group has
>> more than one device, otherwise it will return an error. This conflicts
>> with the deferred attaching. Unfortunately, it has two devices in the
>> same group for my side, for example:
>>
>> [    9.627014] iommu_group_device_count(): device name[0]::01:00.0
>> [    9.633545] iommu_group_device_count(): device name[1]::01:00.1
>> ...
>> [   10.255609] iommu_group_device_count(): device name[0]::02:00.0
>> [   10.262144] iommu_group_device_count(): device name[1]::02:00.1
>>
>> Finally, which caused the failure of tg3 driver when tg3 driver calls
>> the dma_alloc_coherent() to allocate coherent memory in the 
>> tg3_test_dma().
>>
>> [    9.660310] tg3 :01:00.0: DMA engine test failed, aborting
>> [    9.754085] tg3: probe of :01:00.0 failed with error -12
>> [    9.997512] tg3 :01:00.1: DMA engine test failed, aborting
>> [   10.043053] tg3: probe of :01:00.1 failed with error -12
>> [   10.288905] tg3 :02:00.0: DMA engine test failed, aborting
>> [   10.334070] tg3: probe of :02:00.0 failed with error -12
>> [   10.578303] tg3 :02:00.1: DMA engine test failed, aborting
>> [   10.622629] tg3: probe of :02:00.1 failed with error -12
>>
>> In addition, the similar situations also occur in other drivers such
>> as the bnxt_en driver. That can be reproduced easily in kdump kernel
>> when SME is active.
>>
>> Add a check for the deferred attach in the iommu_attach_device() and
>> allow to attach the deferred device regardless of how many devices
>> are in a group.
>>>
>>> Is this iommu_attach_device() call is coming from iommu-dma? (if not, then 
>>> whoever's calling it probably shouldn't be)
>>>
>>
>> Yes, you are right, the iommu_attach_device call is coming from iommu-dma.
>>  
>>> Assuming so, then probably what should happen is to move the handling 
>>> currently in iommu_dma_deferred_attach() into the core so that it can call 
>>> __iommu_attach_device() directly - the intent is just to replay that exact 
>>> call skipped in iommu_group_add_device(), so the legacy external 
>>> iommu_attach_device() interface isn't really the right tool for the job
>>
>> Sounds good. I will check if this can work in various cases. If it's OK, I 
>> will post again.
>>
>> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
>> index f0305e6aac1b..5e7da902ac36 100644
>> --- a/drivers/iommu/dma-iommu.c
>> +++ b/drivers/iommu/dma-iommu.c
>> @@ -23,7 +23,6 @@
>>   #include 
>>   #include 
>>   #include 
>> -#include 
>>   #include 
>>     struct iommu_dma_msi_page {
>> @@ -378,21 +377,6 @@ static int iommu_dma_init_domain(struct iommu_domain 
>> *domain, dma_addr_t base,
>>   return iova_reserve_iommu_regions(dev, domain);
>>   }
>>   -static int iommu_dma_deferred_attach(struct device *dev,
>> -    struct iommu_domain *domain)
>> -{
>> -    const struct iommu_ops *ops = domain->ops;
>> -
>> -    if (!is_kdump_kernel())
>> -    return 0;
>> -
>> -    if (unlikely(ops->is_attach_deferred &&
>> -    ops->is_attach_deferred(domain, dev)))
>> -    return iommu_attach_device(domain, dev);
>> -
>> -    return 0;
>> -}
>> -
>>   /**
>>    * dma_info_to_prot - Translate DMA API directions and attributes to IOMMU 
>> API
>>    *    page flags.
>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>> index ffeebda8d6de..4fed1567b498 100644
>> --- a/drivers/iommu/iommu.c
>> +++ b/drivers/iommu/iommu.c
>> @@ -23,6 +23,7 @@
>>   #include 
>>   #include 
>>   #include 
>> +#include 
>>   #include 
>>     static struct kset *iommu_group_kset;
>> @@ -1952,6 +1953,21 @@ static int __iommu_attach_device(struct iommu_domain 
>> *domain,
>>   return ret;
>>   }
>>   +int