[PATCH AUTOSEL 5.4 37/63] iommu/arm-smmu: Fix arm_smmu_device refcount leak in address translation

2021-07-09 Thread Sasha Levin
From: Xiyu Yang 

[ Upstream commit 7c8f176d6a3fa18aa0f8875da6f7c672ed2a8554 ]

The reference counting issue happens in several exception handling paths
of arm_smmu_iova_to_phys_hard(). When those error scenarios occur, the
function forgets to decrease the refcount of "smmu" increased by
arm_smmu_rpm_get(), causing a refcount leak.

Fix this issue by jumping to "out" label when those error scenarios
occur.

Signed-off-by: Xiyu Yang 
Signed-off-by: Xin Tan 
Reviewed-by: Rob Clark 
Link: 
https://lore.kernel.org/r/1623293391-17261-1-git-send-email-xiyuyan...@fudan.edu.cn
Signed-off-by: Will Deacon 
Signed-off-by: Sasha Levin 
---
 drivers/iommu/arm-smmu.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index abf4cf285548..2185ea5191c1 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -1231,6 +1231,7 @@ static phys_addr_t arm_smmu_iova_to_phys_hard(struct 
iommu_domain *domain,
u64 phys;
unsigned long va, flags;
int ret, idx = cfg->cbndx;
+   phys_addr_t addr = 0;
 
ret = arm_smmu_rpm_get(smmu);
if (ret < 0)
@@ -1249,6 +1250,7 @@ static phys_addr_t arm_smmu_iova_to_phys_hard(struct 
iommu_domain *domain,
dev_err(dev,
"iova to phys timed out on %pad. Falling back to 
software table walk.\n",
&iova);
+   arm_smmu_rpm_put(smmu);
return ops->iova_to_phys(ops, iova);
}
 
@@ -1257,12 +1259,14 @@ static phys_addr_t arm_smmu_iova_to_phys_hard(struct 
iommu_domain *domain,
if (phys & CB_PAR_F) {
dev_err(dev, "translation fault!\n");
dev_err(dev, "PAR = 0x%llx\n", phys);
-   return 0;
+   goto out;
}
 
+   addr = (phys & GENMASK_ULL(39, 12)) | (iova & 0xfff);
+out:
arm_smmu_rpm_put(smmu);
 
-   return (phys & GENMASK_ULL(39, 12)) | (iova & 0xfff);
+   return addr;
 }
 
 static phys_addr_t arm_smmu_iova_to_phys(struct iommu_domain *domain,
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH AUTOSEL 5.4 36/63] iommu/arm-smmu: Fix arm_smmu_device refcount leak when arm_smmu_rpm_get fails

2021-07-09 Thread Sasha Levin
From: Xiyu Yang 

[ Upstream commit 1adf30f198c26539a62d761e45af72cde570413d ]

arm_smmu_rpm_get() invokes pm_runtime_get_sync(), which increases the
refcount of the "smmu" even though the return value is less than 0.

The reference counting issue happens in some error handling paths of
arm_smmu_rpm_get() in its caller functions. When arm_smmu_rpm_get()
fails, the caller functions forget to decrease the refcount of "smmu"
increased by arm_smmu_rpm_get(), causing a refcount leak.

Fix this issue by calling pm_runtime_resume_and_get() instead of
pm_runtime_get_sync() in arm_smmu_rpm_get(), which can keep the refcount
balanced in case of failure.

Signed-off-by: Xiyu Yang 
Signed-off-by: Xin Tan 
Link: 
https://lore.kernel.org/r/1623293672-17954-1-git-send-email-xiyuyan...@fudan.edu.cn
Signed-off-by: Will Deacon 
Signed-off-by: Sasha Levin 
---
 drivers/iommu/arm-smmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index 7c503a6bc585..abf4cf285548 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -114,7 +114,7 @@ static bool using_legacy_binding, using_generic_binding;
 static inline int arm_smmu_rpm_get(struct arm_smmu_device *smmu)
 {
if (pm_runtime_enabled(smmu->dev))
-   return pm_runtime_get_sync(smmu->dev);
+   return pm_runtime_resume_and_get(smmu->dev);
 
return 0;
 }
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH AUTOSEL 5.10 57/93] iommu/arm-smmu: Fix arm_smmu_device refcount leak in address translation

2021-07-09 Thread Sasha Levin
From: Xiyu Yang 

[ Upstream commit 7c8f176d6a3fa18aa0f8875da6f7c672ed2a8554 ]

The reference counting issue happens in several exception handling paths
of arm_smmu_iova_to_phys_hard(). When those error scenarios occur, the
function forgets to decrease the refcount of "smmu" increased by
arm_smmu_rpm_get(), causing a refcount leak.

Fix this issue by jumping to "out" label when those error scenarios
occur.

Signed-off-by: Xiyu Yang 
Signed-off-by: Xin Tan 
Reviewed-by: Rob Clark 
Link: 
https://lore.kernel.org/r/1623293391-17261-1-git-send-email-xiyuyan...@fudan.edu.cn
Signed-off-by: Will Deacon 
Signed-off-by: Sasha Levin 
---
 drivers/iommu/arm/arm-smmu/arm-smmu.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c 
b/drivers/iommu/arm/arm-smmu/arm-smmu.c
index 052f0a1bf037..df24bbe3ea4f 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
@@ -1284,6 +1284,7 @@ static phys_addr_t arm_smmu_iova_to_phys_hard(struct 
iommu_domain *domain,
u64 phys;
unsigned long va, flags;
int ret, idx = cfg->cbndx;
+   phys_addr_t addr = 0;
 
ret = arm_smmu_rpm_get(smmu);
if (ret < 0)
@@ -1303,6 +1304,7 @@ static phys_addr_t arm_smmu_iova_to_phys_hard(struct 
iommu_domain *domain,
dev_err(dev,
"iova to phys timed out on %pad. Falling back to 
software table walk.\n",
&iova);
+   arm_smmu_rpm_put(smmu);
return ops->iova_to_phys(ops, iova);
}
 
@@ -1311,12 +1313,14 @@ static phys_addr_t arm_smmu_iova_to_phys_hard(struct 
iommu_domain *domain,
if (phys & ARM_SMMU_CB_PAR_F) {
dev_err(dev, "translation fault!\n");
dev_err(dev, "PAR = 0x%llx\n", phys);
-   return 0;
+   goto out;
}
 
+   addr = (phys & GENMASK_ULL(39, 12)) | (iova & 0xfff);
+out:
arm_smmu_rpm_put(smmu);
 
-   return (phys & GENMASK_ULL(39, 12)) | (iova & 0xfff);
+   return addr;
 }
 
 static phys_addr_t arm_smmu_iova_to_phys(struct iommu_domain *domain,
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH AUTOSEL 5.10 56/93] iommu/arm-smmu: Fix arm_smmu_device refcount leak when arm_smmu_rpm_get fails

2021-07-09 Thread Sasha Levin
From: Xiyu Yang 

[ Upstream commit 1adf30f198c26539a62d761e45af72cde570413d ]

arm_smmu_rpm_get() invokes pm_runtime_get_sync(), which increases the
refcount of the "smmu" even though the return value is less than 0.

The reference counting issue happens in some error handling paths of
arm_smmu_rpm_get() in its caller functions. When arm_smmu_rpm_get()
fails, the caller functions forget to decrease the refcount of "smmu"
increased by arm_smmu_rpm_get(), causing a refcount leak.

Fix this issue by calling pm_runtime_resume_and_get() instead of
pm_runtime_get_sync() in arm_smmu_rpm_get(), which can keep the refcount
balanced in case of failure.

Signed-off-by: Xiyu Yang 
Signed-off-by: Xin Tan 
Link: 
https://lore.kernel.org/r/1623293672-17954-1-git-send-email-xiyuyan...@fudan.edu.cn
Signed-off-by: Will Deacon 
Signed-off-by: Sasha Levin 
---
 drivers/iommu/arm/arm-smmu/arm-smmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c 
b/drivers/iommu/arm/arm-smmu/arm-smmu.c
index bcbacf22331d..052f0a1bf037 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
@@ -74,7 +74,7 @@ static bool using_legacy_binding, using_generic_binding;
 static inline int arm_smmu_rpm_get(struct arm_smmu_device *smmu)
 {
if (pm_runtime_enabled(smmu->dev))
-   return pm_runtime_get_sync(smmu->dev);
+   return pm_runtime_resume_and_get(smmu->dev);
 
return 0;
 }
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH AUTOSEL 5.12 061/104] iommu/arm-smmu: Fix arm_smmu_device refcount leak in address translation

2021-07-09 Thread Sasha Levin
From: Xiyu Yang 

[ Upstream commit 7c8f176d6a3fa18aa0f8875da6f7c672ed2a8554 ]

The reference counting issue happens in several exception handling paths
of arm_smmu_iova_to_phys_hard(). When those error scenarios occur, the
function forgets to decrease the refcount of "smmu" increased by
arm_smmu_rpm_get(), causing a refcount leak.

Fix this issue by jumping to "out" label when those error scenarios
occur.

Signed-off-by: Xiyu Yang 
Signed-off-by: Xin Tan 
Reviewed-by: Rob Clark 
Link: 
https://lore.kernel.org/r/1623293391-17261-1-git-send-email-xiyuyan...@fudan.edu.cn
Signed-off-by: Will Deacon 
Signed-off-by: Sasha Levin 
---
 drivers/iommu/arm/arm-smmu/arm-smmu.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c 
b/drivers/iommu/arm/arm-smmu/arm-smmu.c
index 128c2c87b4e5..c6ff32797a23 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
@@ -1268,6 +1268,7 @@ static phys_addr_t arm_smmu_iova_to_phys_hard(struct 
iommu_domain *domain,
u64 phys;
unsigned long va, flags;
int ret, idx = cfg->cbndx;
+   phys_addr_t addr = 0;
 
ret = arm_smmu_rpm_get(smmu);
if (ret < 0)
@@ -1287,6 +1288,7 @@ static phys_addr_t arm_smmu_iova_to_phys_hard(struct 
iommu_domain *domain,
dev_err(dev,
"iova to phys timed out on %pad. Falling back to 
software table walk.\n",
&iova);
+   arm_smmu_rpm_put(smmu);
return ops->iova_to_phys(ops, iova);
}
 
@@ -1295,12 +1297,14 @@ static phys_addr_t arm_smmu_iova_to_phys_hard(struct 
iommu_domain *domain,
if (phys & ARM_SMMU_CB_PAR_F) {
dev_err(dev, "translation fault!\n");
dev_err(dev, "PAR = 0x%llx\n", phys);
-   return 0;
+   goto out;
}
 
+   addr = (phys & GENMASK_ULL(39, 12)) | (iova & 0xfff);
+out:
arm_smmu_rpm_put(smmu);
 
-   return (phys & GENMASK_ULL(39, 12)) | (iova & 0xfff);
+   return addr;
 }
 
 static phys_addr_t arm_smmu_iova_to_phys(struct iommu_domain *domain,
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH AUTOSEL 5.12 060/104] iommu/arm-smmu: Fix arm_smmu_device refcount leak when arm_smmu_rpm_get fails

2021-07-09 Thread Sasha Levin
From: Xiyu Yang 

[ Upstream commit 1adf30f198c26539a62d761e45af72cde570413d ]

arm_smmu_rpm_get() invokes pm_runtime_get_sync(), which increases the
refcount of the "smmu" even though the return value is less than 0.

The reference counting issue happens in some error handling paths of
arm_smmu_rpm_get() in its caller functions. When arm_smmu_rpm_get()
fails, the caller functions forget to decrease the refcount of "smmu"
increased by arm_smmu_rpm_get(), causing a refcount leak.

Fix this issue by calling pm_runtime_resume_and_get() instead of
pm_runtime_get_sync() in arm_smmu_rpm_get(), which can keep the refcount
balanced in case of failure.

Signed-off-by: Xiyu Yang 
Signed-off-by: Xin Tan 
Link: 
https://lore.kernel.org/r/1623293672-17954-1-git-send-email-xiyuyan...@fudan.edu.cn
Signed-off-by: Will Deacon 
Signed-off-by: Sasha Levin 
---
 drivers/iommu/arm/arm-smmu/arm-smmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c 
b/drivers/iommu/arm/arm-smmu/arm-smmu.c
index d8c6bfde6a61..128c2c87b4e5 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
@@ -74,7 +74,7 @@ static bool using_legacy_binding, using_generic_binding;
 static inline int arm_smmu_rpm_get(struct arm_smmu_device *smmu)
 {
if (pm_runtime_enabled(smmu->dev))
-   return pm_runtime_get_sync(smmu->dev);
+   return pm_runtime_resume_and_get(smmu->dev);
 
return 0;
 }
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH AUTOSEL 5.12 056/104] iommu/arm-smmu-qcom: Skip the TTBR1 quirk for db820c.

2021-07-09 Thread Sasha Levin
From: Eric Anholt 

[ Upstream commit a242f4297cfe3f4589a7620dcd42cc503607fc6b ]

db820c wants to use the qcom smmu path to get HUPCF set (which keeps
the GPU from wedging and then sometimes wedging the kernel after a
page fault), but it doesn't have separate pagetables support yet in
drm/msm so we can't go all the way to the TTBR1 path.

Signed-off-by: Eric Anholt 
Reviewed-by: Bjorn Andersson 
Link: https://lore.kernel.org/r/20210326231303.3071950-1-e...@anholt.net
Signed-off-by: Will Deacon 
Signed-off-by: Sasha Levin 
---
 drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c 
b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
index 98b3a1c2a181..44a427833385 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
@@ -130,6 +130,16 @@ static int qcom_adreno_smmu_alloc_context_bank(struct 
arm_smmu_domain *smmu_doma
return __arm_smmu_alloc_bitmap(smmu->context_map, start, count);
 }
 
+static bool qcom_adreno_can_do_ttbr1(struct arm_smmu_device *smmu)
+{
+   const struct device_node *np = smmu->dev->of_node;
+
+   if (of_device_is_compatible(np, "qcom,msm8996-smmu-v2"))
+   return false;
+
+   return true;
+}
+
 static int qcom_adreno_smmu_init_context(struct arm_smmu_domain *smmu_domain,
struct io_pgtable_cfg *pgtbl_cfg, struct device *dev)
 {
@@ -144,7 +154,8 @@ static int qcom_adreno_smmu_init_context(struct 
arm_smmu_domain *smmu_domain,
 * be AARCH64 stage 1 but double check because the arm-smmu code assumes
 * that is the case when the TTBR1 quirk is enabled
 */
-   if ((smmu_domain->stage == ARM_SMMU_DOMAIN_S1) &&
+   if (qcom_adreno_can_do_ttbr1(smmu_domain->smmu) &&
+   (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) &&
(smmu_domain->cfg.fmt == ARM_SMMU_CTX_FMT_AARCH64))
pgtbl_cfg->quirks |= IO_PGTABLE_QUIRK_ARM_TTBR1;
 
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH AUTOSEL 5.13 065/114] iommu/arm-smmu: Fix arm_smmu_device refcount leak when arm_smmu_rpm_get fails

2021-07-09 Thread Sasha Levin
From: Xiyu Yang 

[ Upstream commit 1adf30f198c26539a62d761e45af72cde570413d ]

arm_smmu_rpm_get() invokes pm_runtime_get_sync(), which increases the
refcount of the "smmu" even though the return value is less than 0.

The reference counting issue happens in some error handling paths of
arm_smmu_rpm_get() in its caller functions. When arm_smmu_rpm_get()
fails, the caller functions forget to decrease the refcount of "smmu"
increased by arm_smmu_rpm_get(), causing a refcount leak.

Fix this issue by calling pm_runtime_resume_and_get() instead of
pm_runtime_get_sync() in arm_smmu_rpm_get(), which can keep the refcount
balanced in case of failure.

Signed-off-by: Xiyu Yang 
Signed-off-by: Xin Tan 
Link: 
https://lore.kernel.org/r/1623293672-17954-1-git-send-email-xiyuyan...@fudan.edu.cn
Signed-off-by: Will Deacon 
Signed-off-by: Sasha Levin 
---
 drivers/iommu/arm/arm-smmu/arm-smmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c 
b/drivers/iommu/arm/arm-smmu/arm-smmu.c
index 6f72c4d208ca..ee6cac9e7c02 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
@@ -74,7 +74,7 @@ static bool using_legacy_binding, using_generic_binding;
 static inline int arm_smmu_rpm_get(struct arm_smmu_device *smmu)
 {
if (pm_runtime_enabled(smmu->dev))
-   return pm_runtime_get_sync(smmu->dev);
+   return pm_runtime_resume_and_get(smmu->dev);
 
return 0;
 }
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH AUTOSEL 5.13 066/114] iommu/arm-smmu: Fix arm_smmu_device refcount leak in address translation

2021-07-09 Thread Sasha Levin
From: Xiyu Yang 

[ Upstream commit 7c8f176d6a3fa18aa0f8875da6f7c672ed2a8554 ]

The reference counting issue happens in several exception handling paths
of arm_smmu_iova_to_phys_hard(). When those error scenarios occur, the
function forgets to decrease the refcount of "smmu" increased by
arm_smmu_rpm_get(), causing a refcount leak.

Fix this issue by jumping to "out" label when those error scenarios
occur.

Signed-off-by: Xiyu Yang 
Signed-off-by: Xin Tan 
Reviewed-by: Rob Clark 
Link: 
https://lore.kernel.org/r/1623293391-17261-1-git-send-email-xiyuyan...@fudan.edu.cn
Signed-off-by: Will Deacon 
Signed-off-by: Sasha Levin 
---
 drivers/iommu/arm/arm-smmu/arm-smmu.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c 
b/drivers/iommu/arm/arm-smmu/arm-smmu.c
index ee6cac9e7c02..1a647e0ea3eb 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
@@ -1271,6 +1271,7 @@ static phys_addr_t arm_smmu_iova_to_phys_hard(struct 
iommu_domain *domain,
u64 phys;
unsigned long va, flags;
int ret, idx = cfg->cbndx;
+   phys_addr_t addr = 0;
 
ret = arm_smmu_rpm_get(smmu);
if (ret < 0)
@@ -1290,6 +1291,7 @@ static phys_addr_t arm_smmu_iova_to_phys_hard(struct 
iommu_domain *domain,
dev_err(dev,
"iova to phys timed out on %pad. Falling back to 
software table walk.\n",
&iova);
+   arm_smmu_rpm_put(smmu);
return ops->iova_to_phys(ops, iova);
}
 
@@ -1298,12 +1300,14 @@ static phys_addr_t arm_smmu_iova_to_phys_hard(struct 
iommu_domain *domain,
if (phys & ARM_SMMU_CB_PAR_F) {
dev_err(dev, "translation fault!\n");
dev_err(dev, "PAR = 0x%llx\n", phys);
-   return 0;
+   goto out;
}
 
+   addr = (phys & GENMASK_ULL(39, 12)) | (iova & 0xfff);
+out:
arm_smmu_rpm_put(smmu);
 
-   return (phys & GENMASK_ULL(39, 12)) | (iova & 0xfff);
+   return addr;
 }
 
 static phys_addr_t arm_smmu_iova_to_phys(struct iommu_domain *domain,
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH AUTOSEL 5.13 061/114] iommu/arm-smmu-qcom: Skip the TTBR1 quirk for db820c.

2021-07-09 Thread Sasha Levin
From: Eric Anholt 

[ Upstream commit a242f4297cfe3f4589a7620dcd42cc503607fc6b ]

db820c wants to use the qcom smmu path to get HUPCF set (which keeps
the GPU from wedging and then sometimes wedging the kernel after a
page fault), but it doesn't have separate pagetables support yet in
drm/msm so we can't go all the way to the TTBR1 path.

Signed-off-by: Eric Anholt 
Reviewed-by: Bjorn Andersson 
Link: https://lore.kernel.org/r/20210326231303.3071950-1-e...@anholt.net
Signed-off-by: Will Deacon 
Signed-off-by: Sasha Levin 
---
 drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c 
b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
index 98b3a1c2a181..44a427833385 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
@@ -130,6 +130,16 @@ static int qcom_adreno_smmu_alloc_context_bank(struct 
arm_smmu_domain *smmu_doma
return __arm_smmu_alloc_bitmap(smmu->context_map, start, count);
 }
 
+static bool qcom_adreno_can_do_ttbr1(struct arm_smmu_device *smmu)
+{
+   const struct device_node *np = smmu->dev->of_node;
+
+   if (of_device_is_compatible(np, "qcom,msm8996-smmu-v2"))
+   return false;
+
+   return true;
+}
+
 static int qcom_adreno_smmu_init_context(struct arm_smmu_domain *smmu_domain,
struct io_pgtable_cfg *pgtbl_cfg, struct device *dev)
 {
@@ -144,7 +154,8 @@ static int qcom_adreno_smmu_init_context(struct 
arm_smmu_domain *smmu_domain,
 * be AARCH64 stage 1 but double check because the arm-smmu code assumes
 * that is the case when the TTBR1 quirk is enabled
 */
-   if ((smmu_domain->stage == ARM_SMMU_DOMAIN_S1) &&
+   if (qcom_adreno_can_do_ttbr1(smmu_domain->smmu) &&
+   (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) &&
(smmu_domain->cfg.fmt == ARM_SMMU_CTX_FMT_AARCH64))
pgtbl_cfg->quirks |= IO_PGTABLE_QUIRK_ARM_TTBR1;
 
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC v2] /dev/iommu uAPI proposal

2021-07-09 Thread Alex Williamson
Hi Kevin,

A couple first pass comments...

On Fri, 9 Jul 2021 07:48:44 +
"Tian, Kevin"  wrote:
> 2.2. /dev/vfio device uAPI
> ++
> 
> /*
>   * Bind a vfio_device to the specified IOMMU fd
>   *
>   * The user should provide a device cookie when calling this ioctl. The 
>   * cookie is later used in IOMMU fd for capability query, iotlb invalidation
>   * and I/O fault handling.
>   *
>   * User is not allowed to access the device before the binding operation
>   * is completed.
>   *
>   * Unbind is automatically conducted when device fd is closed.
>   *
>   * Input parameters:
>   *   - iommu_fd;
>   *   - cookie;
>   *
>   * Return: 0 on success, -errno on failure.
>   */
> #define VFIO_BIND_IOMMU_FD_IO(VFIO_TYPE, VFIO_BASE + 22)

I believe this is an ioctl on the device fd, therefore it should be
named VFIO_DEVICE_BIND_IOMMU_FD.

> 
> 
> /*
>   * Report vPASID info to userspace via VFIO_DEVICE_GET_INFO
>   *
>   * Add a new device capability. The presence indicates that the user
>   * is allowed to create multiple I/O address spaces on this device. The
>   * capability further includes following flags:
>   *
>   *   - PASID_DELEGATED, if clear every vPASID must be registered to 
>   * the kernel;
>   *   - PASID_CPU, if set vPASID is allowed to be carried in the CPU 
>   * instructions (e.g. ENQCMD);
>   *   - PASID_CPU_VIRT, if set require vPASID translation in the CPU; 
>   * 
>   * The user must check that all devices with PASID_CPU set have the 
>   * same setting on PASID_CPU_VIRT. If mismatching, it should enable 
>   * vPASID only in one category (all set, or all clear).
>   *
>   * When the user enables vPASID on the device with PASID_CPU_VIRT
>   * set, it must enable vPASID CPU translation via kvm fd before attempting
>   * to use ENQCMD to submit work items. The command portal is blocked 
>   * by the kernel until the CPU translation is enabled.
>   */
> #define VFIO_DEVICE_INFO_CAP_PASID5
> 
> 
> /*
>   * Attach a vfio device to the specified IOASID
>   *
>   * Multiple vfio devices can be attached to the same IOASID, and vice 
>   * versa. 
>   *
>   * User may optionally provide a "virtual PASID" to mark an I/O page 
>   * table on this vfio device, if PASID_DELEGATED is not set in device info. 
>   * Whether the virtual PASID is physically used or converted to another 
>   * kernel-allocated PASID is a policy in the kernel.
>   *
>   * Because one device is allowed to bind to multiple IOMMU fd's, the
>   * user should provide both iommu_fd and ioasid for this attach operation.
>   *
>   * Input parameter:
>   *   - iommu_fd;
>   *   - ioasid;
>   *   - flag;
>   *   - vpasid (if specified);
>   * 
>   * Return: 0 on success, -errno on failure.
>   */
> #define VFIO_ATTACH_IOASID_IO(VFIO_TYPE, VFIO_BASE + 23)
> #define VFIO_DETACH_IOASID_IO(VFIO_TYPE, VFIO_BASE + 24)

Likewise, VFIO_DEVICE_{ATTACH,DETACH}_IOASID

...
> 3. Sample structures and helper functions
> 
> 
> Three helper functions are provided to support VFIO_BIND_IOMMU_FD:
> 
>   struct iommu_ctx *iommu_ctx_fdget(int fd);
>   struct iommu_dev *iommu_register_device(struct iommu_ctx *ctx,
>   struct device *device, u64 cookie);
>   int iommu_unregister_device(struct iommu_dev *dev);
> 
> An iommu_ctx is created for each fd:
> 
>   struct iommu_ctx {
>   // a list of allocated IOASID data's
>   struct xarray   ioasid_xa;
> 
>   // a list of registered devices
>   struct xarray   dev_xa;
>   };
> 
> Later some group-tracking fields will be also introduced to support 
> multi-devices group.
> 
> Each registered device is represented by iommu_dev:
> 
>   struct iommu_dev {
>   struct iommu_ctx*ctx;
>   // always be the physical device
>   struct device   *device;
>   u64 cookie;
>   struct kref kref;
>   };
> 
> A successful binding establishes a security context for the bound
> device and returns struct iommu_dev pointer to the caller. After this
> point, the user is allowed to query device capabilities via IOMMU_
> DEVICE_GET_INFO.

If we have an initial singleton group only restriction, I assume that
both iommu_register_device() would fail for any devices that are not in
a singleton group and vfio would only expose direct device files for
the devices in singleton groups.  The latter implementation could
change when multi-device group support is added so that userspace can
assume that if the vfio device file exists, this interface is available.
I think this is confirmed further below.

> For mdev the struct device should be the pointer to the parent device. 

I don't get how iommu_register_device() differentiates an mdev from a
pdev in this case.

...
> 4.3. IOASID nesting (software)
> 

Re: [PATCH v2 0/3] iommu: Enable non-strict DMA on QCom SD/MMC

2021-07-09 Thread Robin Murphy

On 2021-07-08 09:08, Joerg Roedel wrote:

On Wed, Jul 07, 2021 at 01:00:13PM -0700, Doug Anderson wrote:

a) Nothing is inherently broken with my current approach.

b) My current approach doesn't make anybody terribly upset even if
nobody is totally in love with it.


Well, no, sorry :)

I don't think it is a good idea to allow drivers to opt-out of the
strict-setting. This is a platform or user decision, and the driver
should accept whatever it gets.

So the real question is still why strict is the default setting and how
to change that.


It's occurred to me whilst hacking on the relevant area that there's an 
important point I may have somewhat glossed over there: most of the 
IOMMU drivers that are used for arm64 do not take advantage of 
non-strict mode anyway. If anything it would be detrimental, since 
iommu-dma would waste a bunch of time and memory managing flush queues 
and firing off the batch invalidations while internally the drivers are 
still invalidating each unmap synchronously.


Those IOMMUs in mobile and embedded SoCs are also mostly used for media 
devices, where the buffers are relatively large and change relatively 
infrequently, so they are less likely to gain significantly from 
supporting non-strict mode. It's primarily the Arm SMMUs which get used 
in the more "x86-like" paradigm (especially in larger systems) of being 
stuck in front of everything including networking/storage/PCIe/etc. 
where the workloads are far more varied.


Robin.


Or document for the users that want performance how to
change the setting, so that they can decide.

Regards,

Joerg

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 4/4] drm/i915/fbc: Allow FBC + VT-d on SKL/BXT

2021-07-09 Thread Ville Syrjala
From: Ville Syrjälä 

With the iommu driver disabling VT-d superpage it should be
safe to use FBC on SKL/BXT with VT-d otherwise enabled.

Cc: David Woodhouse 
Cc: Lu Baolu 
Cc: iommu@lists.linux-foundation.org
Signed-off-by: Ville Syrjälä 
---
 drivers/gpu/drm/i915/display/intel_fbc.c | 16 
 1 file changed, 16 deletions(-)

diff --git a/drivers/gpu/drm/i915/display/intel_fbc.c 
b/drivers/gpu/drm/i915/display/intel_fbc.c
index 82effb64a3b9..de44f93a33d0 100644
--- a/drivers/gpu/drm/i915/display/intel_fbc.c
+++ b/drivers/gpu/drm/i915/display/intel_fbc.c
@@ -1448,19 +1448,6 @@ static int intel_sanitize_fbc_option(struct 
drm_i915_private *dev_priv)
return 0;
 }
 
-static bool need_fbc_vtd_wa(struct drm_i915_private *dev_priv)
-{
-   /* WaFbcTurnOffFbcWhenHyperVisorIsUsed:skl,bxt */
-   if (intel_vtd_active() &&
-   (IS_SKYLAKE(dev_priv) || IS_BROXTON(dev_priv))) {
-   drm_info(&dev_priv->drm,
-"Disabling framebuffer compression (FBC) to prevent 
screen flicker with VT-d enabled\n");
-   return true;
-   }
-
-   return false;
-}
-
 /**
  * intel_fbc_init - Initialize FBC
  * @dev_priv: the i915 device
@@ -1478,9 +1465,6 @@ void intel_fbc_init(struct drm_i915_private *dev_priv)
if (!drm_mm_initialized(&dev_priv->mm.stolen))
mkwrite_device_info(dev_priv)->display.has_fbc = false;
 
-   if (need_fbc_vtd_wa(dev_priv))
-   mkwrite_device_info(dev_priv)->display.has_fbc = false;
-
dev_priv->params.enable_fbc = intel_sanitize_fbc_option(dev_priv);
drm_dbg_kms(&dev_priv->drm, "Sanitized enable_fbc value: %d\n",
dev_priv->params.enable_fbc);
-- 
2.31.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH 3/4] iommu/vt-d: Disable superpage for Skylake igfx

2021-07-09 Thread Ville Syrjala
From: Ville Syrjälä 

Skylake has known issues with VT-d superpage. Namely frame buffer
compression (FBC) can't be safely used when superpage is enabled.
Currently we're disabling FBC entirely when VT-d is active, but
I think just disabling superpage would be better since FBC can
save some power.

TODO: might be nice to disable superpage only for the igfx iommu
  instead of both iommus
TODO: would be nice to use the macros from include/drm/i915_pciids.h,
  but can't do that with DECLARE_PCI_FIXUP_HEADER()

Cc: David Woodhouse 
Cc: Lu Baolu 
Cc: iommu@lists.linux-foundation.org
Signed-off-by: Ville Syrjälä 
---
 drivers/iommu/intel/iommu.c | 27 +++
 1 file changed, 27 insertions(+)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 40117f868761..14f951ca4799 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -5623,6 +5623,33 @@ static void quirk_iommu_nosp(struct pci_dev *dev)
intel_iommu_superpage = 0;
 }
 
+/* Skylake igfx has issues with superpage */
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x1906, quirk_iommu_nosp);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x1913, quirk_iommu_nosp);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x190E, quirk_iommu_nosp);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x1915, quirk_iommu_nosp);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x1902, quirk_iommu_nosp);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x190A, quirk_iommu_nosp);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x190B, quirk_iommu_nosp);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x1917, quirk_iommu_nosp);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x1916, quirk_iommu_nosp);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x1921, quirk_iommu_nosp);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x191E, quirk_iommu_nosp);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x1912, quirk_iommu_nosp);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x191A, quirk_iommu_nosp);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x191B, quirk_iommu_nosp);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x191D, quirk_iommu_nosp);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x1923, quirk_iommu_nosp);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x1926, quirk_iommu_nosp);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x1927, quirk_iommu_nosp);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x192A, quirk_iommu_nosp);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x192B, quirk_iommu_nosp);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x192D, quirk_iommu_nosp);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x1932, quirk_iommu_nosp);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x193A, quirk_iommu_nosp);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x193B, quirk_iommu_nosp);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x193D, quirk_iommu_nosp);
+
 /* Broxton igfx has issues with superpage */
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x0A84, quirk_iommu_nosp);
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x1A84, quirk_iommu_nosp);
-- 
2.31.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH 2/4] iommu/vt-d: Disable superpage for Broxton igfx

2021-07-09 Thread Ville Syrjala
From: Ville Syrjälä 

Broxton has known issues with VT-d superpage. Namely frame buffer
compression (FBC) can't be safely used when superpage is enabled.
Currently we're disabling FBC entirely when VT-d is active, but
I think just disabling superpage would be better since FBC can
save some power.

TODO: might be nice to disable superpage only for the igfx iommu
  instead of both iommus
TODO: would be nice to use the macros from include/drm/i915_pciids.h,
  but can't do that with DECLARE_PCI_FIXUP_HEADER()

Cc: David Woodhouse 
Cc: Lu Baolu 
Cc: iommu@lists.linux-foundation.org
Signed-off-by: Ville Syrjälä 
---
 drivers/iommu/intel/iommu.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 4fff2c9c86af..40117f868761 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -5623,6 +5623,13 @@ static void quirk_iommu_nosp(struct pci_dev *dev)
intel_iommu_superpage = 0;
 }
 
+/* Broxton igfx has issues with superpage */
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x0A84, quirk_iommu_nosp);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x1A84, quirk_iommu_nosp);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x1A85, quirk_iommu_nosp);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x5A84, quirk_iommu_nosp);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x5A85, quirk_iommu_nosp);
+
 /* Geminilake igfx appears to have issues with superpage */
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x3184, quirk_iommu_nosp);
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x3185, quirk_iommu_nosp);
-- 
2.31.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH 1/4] iommu/vt-d: Disable superpage for Geminilake igfx

2021-07-09 Thread Ville Syrjala
From: Ville Syrjälä 

While running "gem_exec_big --r single" from igt-gpu-tools on
Geminilake as soon as a 2M mapping is made I tend to get a DMAR
write fault. Strangely the faulting address is always a 4K page
and usually very far away from the 2M page that got mapped.
But if no 2M mappings get used I can't reproduce the fault.

I also tried to dump the PTE for the faulting address but it actually
looks correct to me (ie. definitely seems to have the write bit set):
 DMAR: DRHD: handling fault status reg 2
 DMAR: [DMA Write] Request device [00:02.0] PASID  fault addr 
7fa8a78000 [fault reason 05] PTE Write access is not set
 DMAR: fault 7fa8a78000 (level=1) PTE = 149efc003

So not really sure what's going on and this might just be full on duct
tape, but it seems to work here. The machine has now survived a whole day
running that test whereas with superpage enabled it fails in less than
a minute usually.

TODO: might be nice to disable superpage only for the igfx iommu
  instead of both iommus
TODO: would be nice to use the macros from include/drm/i915_pciids.h,
  but can't do that with DECLARE_PCI_FIXUP_HEADER()

Cc: David Woodhouse 
Cc: Lu Baolu 
Cc: iommu@lists.linux-foundation.org
Signed-off-by: Ville Syrjälä 
---
 drivers/iommu/intel/iommu.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 19c7888cbb86..4fff2c9c86af 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -5617,6 +5617,16 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x1632, 
quirk_iommu_igfx);
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x163A, quirk_iommu_igfx);
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x163D, quirk_iommu_igfx);
 
+static void quirk_iommu_nosp(struct pci_dev *dev)
+{
+   pci_info(dev, "Disabling IOMMU superpage for graphics on this 
chipset\n");
+   intel_iommu_superpage = 0;
+}
+
+/* Geminilake igfx appears to have issues with superpage */
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x3184, quirk_iommu_nosp);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x3185, quirk_iommu_nosp);
+
 static void quirk_iommu_rwbf(struct pci_dev *dev)
 {
if (risky_device(dev))
-- 
2.31.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH 0/4] iommu/vt-d: Disable igfx iommu superpage on bxt/skl/glk

2021-07-09 Thread Ville Syrjala
From: Ville Syrjälä 

I ran into some kind of fail with VT-d superpage on Geminlake igfx,
so without any better ideas let's just disable it.

Additionally Skylake/Broxton igfx have known issues with VT-d
superpage as well, so let's disable it there as well. This should
let us re-enable frame buffer compression (FBC) for some extra
power savings when the display is on.

Cc: David Woodhouse 
Cc: Lu Baolu 
Cc: iommu@lists.linux-foundation.org

Ville Syrjälä (4):
  iommu/vt-d: Disable superpage for Geminilake igfx
  iommu/vt-d: Disable superpage for Broxton igfx
  iommu/vt-d: Disable superpage for Skylake igfx
  drm/i915/fbc: Allow FBC + VT-d on SKL/BXT

 drivers/gpu/drm/i915/display/intel_fbc.c | 16 -
 drivers/iommu/intel/iommu.c  | 44 
 2 files changed, 44 insertions(+), 16 deletions(-)

-- 
2.31.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [bug report] iommu_dma_unmap_sg() is very slow then running IO from remote numa node

2021-07-09 Thread Ming Lei
On Fri, Jul 09, 2021 at 11:26:53AM +0100, Robin Murphy wrote:
> On 2021-07-09 09:38, Ming Lei wrote:
> > Hello,
> > 
> > I observed that NVMe performance is very bad when running fio on one
> > CPU(aarch64) in remote numa node compared with the nvme pci numa node.
> > 
> > Please see the test result[1] 327K vs. 34.9K.
> > 
> > Latency trace shows that one big difference is in iommu_dma_unmap_sg(),
> >  nsecs vs 25437 nsecs.
> 
> Are you able to dig down further into that? iommu_dma_unmap_sg() itself
> doesn't do anything particularly special, so whatever makes a difference is
> probably happening at a lower level, and I suspect there's probably an SMMU
> involved. If for instance it turns out to go all the way down to
> __arm_smmu_cmdq_poll_until_consumed() because polling MMIO from the wrong
> node is slow, there's unlikely to be much you can do about that other than
> the global "go faster" knobs (iommu.strict and iommu.passthrough) with their
> associated compromises.

Follows the log of 'perf report'

1) good(run fio from cpus in the nvme's numa node)

-   34.86% 1.73%  fio   [nvme]  [k] nvme_process_cq 
 ▒
   - 33.13% nvme_process_cq 
 ▒
  - 32.93% nvme_pci_complete_rq 
 ▒
 - 24.92% nvme_unmap_data   
 ▒
- 20.08% dma_unmap_sg_attrs 
 ▒
   - 19.79% iommu_dma_unmap_sg  
 ▒
  - 19.55% __iommu_dma_unmap
 ▒
 - 16.86% arm_smmu_iotlb_sync   
 ▒
- 16.81% arm_smmu_tlb_inv_range_domain  
 ▒
   - 14.73% __arm_smmu_tlb_inv_range
 ▒
14.44% arm_smmu_cmdq_issue_cmdlist  
 ▒
 0.89% __pi_memset  
 ▒
 0.75% arm_smmu_atc_inv_domain  
 ▒
 + 1.58% iommu_unmap_fast   
 ▒
 + 0.71% iommu_dma_free_iova
 ▒
- 3.25% dma_unmap_page_attrs
 ▒
   - 3.21% iommu_dma_unmap_page 
 ▒
  - 3.14% __iommu_dma_unmap_swiotlb 
 ▒
 - 2.86% __iommu_dma_unmap  
 ▒
- 2.48% arm_smmu_iotlb_sync 
 ▒
   - 2.47% arm_smmu_tlb_inv_range_domain
 ▒
  - 2.19% __arm_smmu_tlb_inv_range  
 ▒
   2.16% arm_smmu_cmdq_issue_cmdlist
 ▒
+ 1.34% mempool_free
 ▒
 + 7.68% nvme_complete_rq   
 ▒
   + 1.73% _start


2) bad(run fio from cpus not in the nvme's numa node)
-   49.25% 3.03%  fio   [nvme]  [k] nvme_process_cq 
 ▒
   - 46.22% nvme_process_cq 
 ▒
  - 46.07% nvme_pci_complete_rq 
 ▒
 - 41.02% nvme_unmap_data   
 ▒
- 34.92% dma_unmap_sg_attrs 
 ▒

Re: [bug report] iommu_dma_unmap_sg() is very slow then running IO from remote numa node

2021-07-09 Thread Ming Lei
On Fri, Jul 09, 2021 at 11:16:14AM +0100, Russell King (Oracle) wrote:
> On Fri, Jul 09, 2021 at 04:38:09PM +0800, Ming Lei wrote:
> > I observed that NVMe performance is very bad when running fio on one
> > CPU(aarch64) in remote numa node compared with the nvme pci numa node.
> 
> Have you checked the effect of running a memory-heavy process using
> memory from node 1 while being executed by CPUs in node 0?

1) aarch64
[root@ampere-mtjade-04 ~]# taskset -c 0 numactl -m 0  perf bench mem memcpy -s 
4GB -f default
# Running 'mem/memcpy' benchmark:
# function 'default' (Default memcpy() provided by glibc)
# Copying 4GB bytes ...

  11.511752 GB/sec
[root@ampere-mtjade-04 ~]# taskset -c 0 numactl -m 1  perf bench mem memcpy -s 
4GB -f default
# Running 'mem/memcpy' benchmark:
# function 'default' (Default memcpy() provided by glibc)
# Copying 4GB bytes ...

   3.084333 GB/sec


2) x86_64[1]
[root@hp-dl380g10-01 mingl]#  taskset -c 0 numactl -m 0  perf bench mem memcpy 
-s 4GB -f default
# Running 'mem/memcpy' benchmark:
# function 'default' (Default memcpy() provided by glibc)
# Copying 4GB bytes ...

   4.193927 GB/sec
[root@hp-dl380g10-01 mingl]#  taskset -c 0 numactl -m 1  perf bench mem memcpy 
-s 4GB -f default
# Running 'mem/memcpy' benchmark:
# function 'default' (Default memcpy() provided by glibc)
# Copying 4GB bytes ...

   3.553392 GB/sec


[1] on this x86_64 machine, IOPS can reach 680K in same fio nvme test 



Thanks,
Ming

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC v1 4/8] intel/vt-d: export intel_iommu_get_resv_regions

2021-07-09 Thread Wei Liu
On Fri, Jul 09, 2021 at 10:17:25PM +0800, Lu Baolu wrote:
> On 2021/7/9 19:43, Wei Liu wrote:
> > When Microsoft Hypervisor runs on Intel platforms it needs to know the
> > reserved regions to program devices correctly. There is no reason to
> > duplicate intel_iommu_get_resv_regions. Export it.
> 
> Why not using iommu_get_resv_regions()?

That calls into ops->get_resv_regions.

In this patch series, get_resv_regions is hv_iommu_resv_regions, which
wants to use intel_iommu_get_resv_regions when it detects the underlying
hardware platform is from Intel.

Wei.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC v1 4/8] intel/vt-d: export intel_iommu_get_resv_regions

2021-07-09 Thread Lu Baolu

On 2021/7/9 19:43, Wei Liu wrote:

When Microsoft Hypervisor runs on Intel platforms it needs to know the
reserved regions to program devices correctly. There is no reason to
duplicate intel_iommu_get_resv_regions. Export it.


Why not using iommu_get_resv_regions()?

Best regards,
baolu



Signed-off-by: Wei Liu 
---
  drivers/iommu/intel/iommu.c | 5 +++--
  include/linux/intel-iommu.h | 4 
  2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index a4294d310b93..01973bc20080 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -5176,8 +5176,8 @@ static void intel_iommu_probe_finalize(struct device *dev)
set_dma_ops(dev, NULL);
  }
  
-static void intel_iommu_get_resv_regions(struct device *device,

-struct list_head *head)
+void intel_iommu_get_resv_regions(struct device *device,
+struct list_head *head)
  {
int prot = DMA_PTE_READ | DMA_PTE_WRITE;
struct iommu_resv_region *reg;
@@ -5232,6 +5232,7 @@ static void intel_iommu_get_resv_regions(struct device 
*device,
return;
list_add_tail(®->list, head);
  }
+EXPORT_SYMBOL_GPL(intel_iommu_get_resv_regions);
  
  int intel_iommu_enable_pasid(struct intel_iommu *iommu, struct device *dev)

  {
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 03faf20a6817..f91869f765bc 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -814,6 +814,8 @@ extern int iommu_calculate_max_sagaw(struct intel_iommu 
*iommu);
  extern int dmar_disabled;
  extern int intel_iommu_enabled;
  extern int intel_iommu_gfx_mapped;
+extern void intel_iommu_get_resv_regions(struct device *device,
+struct list_head *head);
  #else
  static inline int iommu_calculate_agaw(struct intel_iommu *iommu)
  {
@@ -825,6 +827,8 @@ static inline int iommu_calculate_max_sagaw(struct 
intel_iommu *iommu)
  }
  #define dmar_disabled (1)
  #define intel_iommu_enabled (0)
+static inline void intel_iommu_get_resv_regions(struct device *device,
+struct list_head *head) {}
  #endif
  
  #endif



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 0/3] iommu: Enable non-strict DMA on QCom SD/MMC

2021-07-09 Thread Robin Murphy

On 2021-07-08 09:08, Joerg Roedel wrote:

On Wed, Jul 07, 2021 at 01:00:13PM -0700, Doug Anderson wrote:

a) Nothing is inherently broken with my current approach.

b) My current approach doesn't make anybody terribly upset even if
nobody is totally in love with it.


Well, no, sorry :)

I don't think it is a good idea to allow drivers to opt-out of the
strict-setting. This is a platform or user decision, and the driver
should accept whatever it gets.

So the real question is still why strict is the default setting and how
to change that. Or document for the users that want performance how to
change the setting, so that they can decide.


As I mentioned before, conceptually I think this very much belongs in 
sysfs as a user decision. We essentially have 4 levels of "strictness":


1: DMA domain with bounce pages
2: DMA domain
3: DMA domain with flush queue
4: Identity domain

The "make this device go faster because I trust it" use-case is why we 
have the sysfs interface to switch between 2 and 4, so it's entirely 
logical to have the intermediate option as well for when 3 is "faster" 
enough while still giving a bit more peace of mind than full-on bypass.


Making it a platform-specific decision that's hidden in a driver - 
arm-smmu-qcom can be considered a dumping ground of detailed platform 
knowledge ;) - happens to work as a reasonable compromise for this 
specific case, but I concur that it could be viewed as setting a 
precedent for other cases which we definitely aren't as reasonable.


I've been thinking about the sysfs thing some more, and since it's 
Friday afternoon and I can't concentrate on what I'm supposed to be 
doing anyway, let's see how far I can get by Monday...


Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC v1 3/8] intel/vt-d: make DMAR table parsing code more flexible

2021-07-09 Thread Wei Liu
On Fri, Jul 09, 2021 at 01:56:46PM +0100, Robin Murphy wrote:
> On 2021-07-09 12:43, Wei Liu wrote:
> > Microsoft Hypervisor provides a set of hypercalls to manage device
> > domains. The root kernel should parse the DMAR so that it can program
> > the IOMMU (with hypercalls) correctly.
> > 
> > The DMAR code was designed to work with Intel IOMMU only. Add two more
> > parameters to make it useful to Microsoft Hypervisor. Microsoft
> > Hypervisor does not need the DMAR parsing code to allocate an Intel
> > IOMMU structure; it also wishes to always reparse the DMAR table even
> > after it has been parsed before.
> 
> We've recently defined the VIOT table for describing paravirtualised IOMMUs
> - would it make more sense to extend that to support the Microsoft
> implementation than to abuse a hardware-specific table? Am I right in

I searched for VIOT and believed I found the correct link
https://lwn.net/Articles/859291/. My understanding is based on the
reading of that series.

VIOT is useful. I think it solves the problem for guests.

It does not solve the problem we have though. The DMAR tables are not
conjured up by some backend software running on the host side. They are
the real tables provided by the firmware. The kernel here is part of the
host setup, dealing with physical hardware.

No matter how much I wish all vendors unified their tables, I don't see
how that's going to happen for readily available servers. :-(

> assuming said hypervisor isn't intended to only ever run on Intel hardware?

Yes, that's correct. We also plan to add support AMD and ARM64.

Wei.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC v1 6/8] mshv: command line option to skip devices in PV-IOMMU

2021-07-09 Thread Wei Liu
On Fri, Jul 09, 2021 at 01:46:19PM +0100, Robin Murphy wrote:
> On 2021-07-09 12:43, Wei Liu wrote:
> > Some devices may have been claimed by the hypervisor already. One such
> > example is a user can assign a NIC for debugging purpose.
> > 
> > Ideally Linux should be able to tell retrieve that information, but
> > there is no way to do that yet. And designing that new mechanism is
> > going to take time.
> > 
> > Provide a command line option for skipping devices. This is a stopgap
> > solution, so it is intentionally undocumented. Hopefully we can retire
> > it in the future.
> 
> Huh? If the host is using a device, why the heck is it exposing any
> knowledge of that device to the guest at all, let alone allowing the guest
> to do anything that could affect its operation!?

The host in this setup consists of the hypervisor, the root kernel and a
bunch of user space programs.

Root is not an ordinary guest. It does need to know all the hardware to
manage the platform. Hypervisor does not claim more devices than it
needs to, nor does it try to hide hardware details from the root.

The hypervisor can protect itself just fine. Any attempt to use the
already claimed devices will be blocked or rejected, so are the attempts
to attach them to device domains.

That, however, leads to some interesting interactions between the
hypervisor and Linux kernel.  When kernel initializes IOMMU during boot,
it will try to attach all devices in one go. Any failure there will
cause kernel to detach the already attached devices. That's not fatal to
kernel, and is only a minor annoyance to our current use case, because
the default domain is a passthrough domain anyway. It will become
problematic once we switch the default domain to a DMA domain to further
tighten security during Linux boot.

Wei.

> 
> Robin.
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC v1 3/8] intel/vt-d: make DMAR table parsing code more flexible

2021-07-09 Thread Robin Murphy

On 2021-07-09 12:43, Wei Liu wrote:

Microsoft Hypervisor provides a set of hypercalls to manage device
domains. The root kernel should parse the DMAR so that it can program
the IOMMU (with hypercalls) correctly.

The DMAR code was designed to work with Intel IOMMU only. Add two more
parameters to make it useful to Microsoft Hypervisor. Microsoft
Hypervisor does not need the DMAR parsing code to allocate an Intel
IOMMU structure; it also wishes to always reparse the DMAR table even
after it has been parsed before.


We've recently defined the VIOT table for describing paravirtualised 
IOMMUs - would it make more sense to extend that to support the 
Microsoft implementation than to abuse a hardware-specific table? Am I 
right in assuming said hypervisor isn't intended to only ever run on 
Intel hardware?


Robin.


Adjust Intel IOMMU code to use the new dmar_table_init. There should be
no functional change to Intel IOMMU code.

Signed-off-by: Wei Liu 
---
We may be able to combine alloc and force_parse?
---
  drivers/iommu/intel/dmar.c  | 38 -
  drivers/iommu/intel/iommu.c |  2 +-
  drivers/iommu/intel/irq_remapping.c |  2 +-
  include/linux/dmar.h|  2 +-
  4 files changed, 30 insertions(+), 14 deletions(-)

diff --git a/drivers/iommu/intel/dmar.c b/drivers/iommu/intel/dmar.c
index 84057cb9596c..bd72f47c728b 100644
--- a/drivers/iommu/intel/dmar.c
+++ b/drivers/iommu/intel/dmar.c
@@ -408,7 +408,8 @@ dmar_find_dmaru(struct acpi_dmar_hardware_unit *drhd)
   * structure which uniquely represent one DMA remapping hardware unit
   * present in the platform
   */
-static int dmar_parse_one_drhd(struct acpi_dmar_header *header, void *arg)
+static int dmar_parse_one_drhd_internal(struct acpi_dmar_header *header,
+   void *arg, bool alloc)
  {
struct acpi_dmar_hardware_unit *drhd;
struct dmar_drhd_unit *dmaru;
@@ -440,12 +441,14 @@ static int dmar_parse_one_drhd(struct acpi_dmar_header 
*header, void *arg)
return -ENOMEM;
}
  
-	ret = alloc_iommu(dmaru);

-   if (ret) {
-   dmar_free_dev_scope(&dmaru->devices,
-   &dmaru->devices_cnt);
-   kfree(dmaru);
-   return ret;
+   if (alloc) {
+   ret = alloc_iommu(dmaru);
+   if (ret) {
+   dmar_free_dev_scope(&dmaru->devices,
+   &dmaru->devices_cnt);
+   kfree(dmaru);
+   return ret;
+   }
}
dmar_register_drhd_unit(dmaru);
  
@@ -456,6 +459,16 @@ static int dmar_parse_one_drhd(struct acpi_dmar_header *header, void *arg)

return 0;
  }
  
+static int dmar_parse_one_drhd(struct acpi_dmar_header *header, void *arg)

+{
+   return dmar_parse_one_drhd_internal(header, arg, true);
+}
+
+int dmar_parse_one_drhd_noalloc(struct acpi_dmar_header *header, void *arg)
+{
+   return dmar_parse_one_drhd_internal(header, arg, false);
+}
+
  static void dmar_free_drhd(struct dmar_drhd_unit *dmaru)
  {
if (dmaru->devices && dmaru->devices_cnt)
@@ -633,7 +646,7 @@ static inline int dmar_walk_dmar_table(struct 
acpi_table_dmar *dmar,
   * parse_dmar_table - parses the DMA reporting table
   */
  static int __init
-parse_dmar_table(void)
+parse_dmar_table(bool alloc)
  {
struct acpi_table_dmar *dmar;
int drhd_count = 0;
@@ -650,6 +663,9 @@ parse_dmar_table(void)
.cb[ACPI_DMAR_TYPE_SATC] = &dmar_parse_one_satc,
};
  
+	if (!alloc)

+   cb.cb[ACPI_DMAR_TYPE_HARDWARE_UNIT] = 
&dmar_parse_one_drhd_noalloc;
+
/*
 * Do it again, earlier dmar_tbl mapping could be mapped with
 * fixed map.
@@ -840,13 +856,13 @@ void __init dmar_register_bus_notifier(void)
  }
  
  
-int __init dmar_table_init(void)

+int __init dmar_table_init(bool alloc, bool force_parse)
  {
static int dmar_table_initialized;
int ret;
  
-	if (dmar_table_initialized == 0) {

-   ret = parse_dmar_table();
+   if (dmar_table_initialized == 0 || force_parse) {
+   ret = parse_dmar_table(alloc);
if (ret < 0) {
if (ret != -ENODEV)
pr_info("Parse DMAR table failure.\n");
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index be35284a2016..a4294d310b93 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -4310,7 +4310,7 @@ int __init intel_iommu_init(void)
}
  
  	down_write(&dmar_global_lock);

-   if (dmar_table_init()) {
+   if (dmar_table_init(true, false)) {
if (force_on)
panic("tboot: Failed to initialize DMAR table\n");
goto out_free_dmar;
diff --git a/drivers/iommu/intel/irq_remapping.c 
b/drivers/iommu/intel/irq_remapping.c
index f912fe45bea2..0e8abef862e4 100644
--

Re: [RFC v1 6/8] mshv: command line option to skip devices in PV-IOMMU

2021-07-09 Thread Robin Murphy

On 2021-07-09 12:43, Wei Liu wrote:

Some devices may have been claimed by the hypervisor already. One such
example is a user can assign a NIC for debugging purpose.

Ideally Linux should be able to tell retrieve that information, but
there is no way to do that yet. And designing that new mechanism is
going to take time.

Provide a command line option for skipping devices. This is a stopgap
solution, so it is intentionally undocumented. Hopefully we can retire
it in the future.


Huh? If the host is using a device, why the heck is it exposing any 
knowledge of that device to the guest at all, let alone allowing the 
guest to do anything that could affect its operation!?


Robin.


Signed-off-by: Wei Liu 
---
  drivers/iommu/hyperv-iommu.c | 45 
  1 file changed, 45 insertions(+)

diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-iommu.c
index 043dcff06511..353da5036387 100644
--- a/drivers/iommu/hyperv-iommu.c
+++ b/drivers/iommu/hyperv-iommu.c
@@ -349,6 +349,16 @@ static const struct irq_domain_ops 
hyperv_root_ir_domain_ops = {
  
  #ifdef CONFIG_HYPERV_ROOT_PVIOMMU
  
+/* The IOMMU will not claim these PCI devices. */

+static char *pci_devs_to_skip;
+static int __init mshv_iommu_setup_skip(char *str) {
+   pci_devs_to_skip = str;
+
+   return 0;
+}
+/* mshv_iommu_skip=(:BB:DD.F)(:BB:DD.F) */
+__setup("mshv_iommu_skip=", mshv_iommu_setup_skip);
+
  /* DMA remapping support */
  struct hv_iommu_domain {
struct iommu_domain domain;
@@ -774,6 +784,41 @@ static struct iommu_device *hv_iommu_probe_device(struct 
device *dev)
if (!dev_is_pci(dev))
return ERR_PTR(-ENODEV);
  
+	/*

+* Skip the PCI device specified in `pci_devs_to_skip`. This is a
+* temporary solution until we figure out a way to extract information
+* from the hypervisor what devices it is already using.
+*/
+   if (pci_devs_to_skip && *pci_devs_to_skip) {
+   int pos = 0;
+   int parsed;
+   int segment, bus, slot, func;
+   struct pci_dev *pdev = to_pci_dev(dev);
+
+   do {
+   parsed = 0;
+
+   sscanf(pci_devs_to_skip + pos,
+   " (%x:%x:%x.%x) %n",
+   &segment, &bus, &slot, &func, &parsed);
+
+   if (parsed <= 0)
+   break;
+
+   if (pci_domain_nr(pdev->bus) == segment &&
+   pdev->bus->number == bus &&
+   PCI_SLOT(pdev->devfn) == slot &&
+   PCI_FUNC(pdev->devfn) == func)
+   {
+   dev_info(dev, "skipped by MSHV IOMMU\n");
+   return ERR_PTR(-ENODEV);
+   }
+
+   pos += parsed;
+
+   } while (pci_devs_to_skip[pos]);
+   }
+
vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
if (!vdev)
return ERR_PTR(-ENOMEM);


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [bug report] iommu_dma_unmap_sg() is very slow then running IO from remote numa node

2021-07-09 Thread Robin Murphy

On 2021-07-09 12:04, John Garry wrote:

On 09/07/2021 11:26, Robin Murphy wrote:

n 2021-07-09 09:38, Ming Lei wrote:

Hello,

I observed that NVMe performance is very bad when running fio on one
CPU(aarch64) in remote numa node compared with the nvme pci numa node.

Please see the test result[1] 327K vs. 34.9K.

Latency trace shows that one big difference is in iommu_dma_unmap_sg(),
 nsecs vs 25437 nsecs.


Are you able to dig down further into that? iommu_dma_unmap_sg() 
itself doesn't do anything particularly special, so whatever makes a 
difference is probably happening at a lower level, and I suspect 
there's probably an SMMU involved. If for instance it turns out to go 
all the way down to __arm_smmu_cmdq_poll_until_consumed() because 
polling MMIO from the wrong node is slow, there's unlikely to be much 
you can do about that other than the global "go faster" knobs 
(iommu.strict and iommu.passthrough) with their associated compromises.


There was also the disable_msipolling option:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c#n42 



But I am not sure if that platform even supports MSI polling (or has 
smmu v3).


Hmm, I suppose in principle the MSI polling path could lead to a bit of 
cacheline ping-pong with the SMMU fetching and writing back to the sync 
command, but I'd rather find out more details of where exactly the extra 
time is being spent in this particular situation than speculate much 
further.


You could also try iommu.forcedac=1 cmdline option. But I doubt it will 
help since the issue was mentioned to be NUMA related.


Plus that shouldn't make any difference to unmaps anyway.


[1] fio test & results

1) fio test result:

- run fio on local CPU
taskset -c 0 ~/git/tools/test/nvme/io_uring 10 1 /dev/nvme1n1 4k
+ fio --bs=4k --ioengine=io_uring --fixedbufs --registerfiles --hipri 
--iodepth=64 --iodepth_batch_submit=16 
--iodepth_batch_complete_min=16 --filename=/dev/nvme1n1 --direct=1 
--runtime=10 --numjobs=1 --rw=randread --name=test --group_reporting


IOPS: 327K
avg latency of iommu_dma_unmap_sg():  nsecs


- run fio on remote CPU
taskset -c 80 ~/git/tools/test/nvme/io_uring 10 1 /dev/nvme1n1 4k
+ fio --bs=4k --ioengine=io_uring --fixedbufs --registerfiles --hipri 
--iodepth=64 --iodepth_batch_submit=16 
--iodepth_batch_complete_min=16 --filename=/dev/nvme1n1 --direct=1 
--runtime=10 --numjobs=1 --rw=randread --name=test --group_reporting


IOPS: 34.9K
avg latency of iommu_dma_unmap_sg(): 25437 nsecs

2) system info
[root@ampere-mtjade-04 ~]# lscpu | grep NUMA
NUMA node(s):    2
NUMA node0 CPU(s):   0-79
NUMA node1 CPU(s):   80-159

lspci | grep NVMe
0003:01:00.0 Non-Volatile memory controller: Samsung Electronics Co 
Ltd NVMe SSD Controller SM981/PM981/PM983


[root@ampere-mtjade-04 ~]# cat 
/sys/block/nvme1n1/device/device/numa_node 


Since it's ampere, I guess it's smmu v3.

BTW, if you remember, I did raise a performance issue of smmuv3 with 
NVMe before:
https://lore.kernel.org/linux-iommu/b2a6e26d-6d0d-7f0d-f222-589812f70...@huawei.com/ 


It doesn't seem like the best-case throughput is of concern in this case 
though, and my hunch is that a ~23x discrepancy in SMMU unmap 
performance depending on locality probably isn't specific to NVMe.


Robin.

I did have this series to improve performance for systems with lots of 
CPUs, like above, but not accepted:
https://lore.kernel.org/linux-iommu/1598018062-175608-1-git-send-email-john.ga...@huawei.com/ 



Thanks,
John


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[RFC v1 6/8] mshv: command line option to skip devices in PV-IOMMU

2021-07-09 Thread Wei Liu
Some devices may have been claimed by the hypervisor already. One such
example is a user can assign a NIC for debugging purpose.

Ideally Linux should be able to tell retrieve that information, but
there is no way to do that yet. And designing that new mechanism is
going to take time.

Provide a command line option for skipping devices. This is a stopgap
solution, so it is intentionally undocumented. Hopefully we can retire
it in the future.

Signed-off-by: Wei Liu 
---
 drivers/iommu/hyperv-iommu.c | 45 
 1 file changed, 45 insertions(+)

diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-iommu.c
index 043dcff06511..353da5036387 100644
--- a/drivers/iommu/hyperv-iommu.c
+++ b/drivers/iommu/hyperv-iommu.c
@@ -349,6 +349,16 @@ static const struct irq_domain_ops 
hyperv_root_ir_domain_ops = {
 
 #ifdef CONFIG_HYPERV_ROOT_PVIOMMU
 
+/* The IOMMU will not claim these PCI devices. */
+static char *pci_devs_to_skip;
+static int __init mshv_iommu_setup_skip(char *str) {
+   pci_devs_to_skip = str;
+
+   return 0;
+}
+/* mshv_iommu_skip=(:BB:DD.F)(:BB:DD.F) */
+__setup("mshv_iommu_skip=", mshv_iommu_setup_skip);
+
 /* DMA remapping support */
 struct hv_iommu_domain {
struct iommu_domain domain;
@@ -774,6 +784,41 @@ static struct iommu_device *hv_iommu_probe_device(struct 
device *dev)
if (!dev_is_pci(dev))
return ERR_PTR(-ENODEV);
 
+   /*
+* Skip the PCI device specified in `pci_devs_to_skip`. This is a
+* temporary solution until we figure out a way to extract information
+* from the hypervisor what devices it is already using.
+*/
+   if (pci_devs_to_skip && *pci_devs_to_skip) {
+   int pos = 0;
+   int parsed;
+   int segment, bus, slot, func;
+   struct pci_dev *pdev = to_pci_dev(dev);
+
+   do {
+   parsed = 0;
+
+   sscanf(pci_devs_to_skip + pos,
+   " (%x:%x:%x.%x) %n",
+   &segment, &bus, &slot, &func, &parsed);
+
+   if (parsed <= 0)
+   break;
+
+   if (pci_domain_nr(pdev->bus) == segment &&
+   pdev->bus->number == bus &&
+   PCI_SLOT(pdev->devfn) == slot &&
+   PCI_FUNC(pdev->devfn) == func)
+   {
+   dev_info(dev, "skipped by MSHV IOMMU\n");
+   return ERR_PTR(-ENODEV);
+   }
+
+   pos += parsed;
+
+   } while (pci_devs_to_skip[pos]);
+   }
+
vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
if (!vdev)
return ERR_PTR(-ENOMEM);
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC v1 4/8] intel/vt-d: export intel_iommu_get_resv_regions

2021-07-09 Thread Wei Liu
When Microsoft Hypervisor runs on Intel platforms it needs to know the
reserved regions to program devices correctly. There is no reason to
duplicate intel_iommu_get_resv_regions. Export it.

Signed-off-by: Wei Liu 
---
 drivers/iommu/intel/iommu.c | 5 +++--
 include/linux/intel-iommu.h | 4 
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index a4294d310b93..01973bc20080 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -5176,8 +5176,8 @@ static void intel_iommu_probe_finalize(struct device *dev)
set_dma_ops(dev, NULL);
 }
 
-static void intel_iommu_get_resv_regions(struct device *device,
-struct list_head *head)
+void intel_iommu_get_resv_regions(struct device *device,
+struct list_head *head)
 {
int prot = DMA_PTE_READ | DMA_PTE_WRITE;
struct iommu_resv_region *reg;
@@ -5232,6 +5232,7 @@ static void intel_iommu_get_resv_regions(struct device 
*device,
return;
list_add_tail(®->list, head);
 }
+EXPORT_SYMBOL_GPL(intel_iommu_get_resv_regions);
 
 int intel_iommu_enable_pasid(struct intel_iommu *iommu, struct device *dev)
 {
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 03faf20a6817..f91869f765bc 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -814,6 +814,8 @@ extern int iommu_calculate_max_sagaw(struct intel_iommu 
*iommu);
 extern int dmar_disabled;
 extern int intel_iommu_enabled;
 extern int intel_iommu_gfx_mapped;
+extern void intel_iommu_get_resv_regions(struct device *device,
+struct list_head *head);
 #else
 static inline int iommu_calculate_agaw(struct intel_iommu *iommu)
 {
@@ -825,6 +827,8 @@ static inline int iommu_calculate_max_sagaw(struct 
intel_iommu *iommu)
 }
 #define dmar_disabled  (1)
 #define intel_iommu_enabled (0)
+static inline void intel_iommu_get_resv_regions(struct device *device,
+struct list_head *head) {}
 #endif
 
 #endif
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC v1 5/8] mshv: add paravirtualized IOMMU support

2021-07-09 Thread Wei Liu
Microsoft Hypervisor provides a set of hypercalls to manage device
domains. Implement a type-1 IOMMU using those hypercalls.

Implement DMA remapping as the first step for this driver. Interrupt
remapping will come in a later stage.

Signed-off-by: Wei Liu 
---
 drivers/iommu/Kconfig|  14 +
 drivers/iommu/hyperv-iommu.c | 628 +++
 2 files changed, 642 insertions(+)

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 1f111b399bca..e230c4536825 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -397,6 +397,20 @@ config HYPERV_IOMMU
  Stub IOMMU driver to handle IRQs as to allow Hyper-V Linux
  guests to run with x2APIC mode enabled.
 
+config HYPERV_ROOT_PVIOMMU
+   bool "Microsoft Hypervisor paravirtualized IOMMU support"
+   depends on HYPERV && X86
+   select IOMMU_API
+   select IOMMU_DMA
+   select DMA_OPS
+   select IOMMU_IOVA
+   select INTEL_IOMMU
+   select DMAR_TABLE
+   default HYPERV
+   help
+ A paravirtualized IOMMU for Microsoft Hypervisor. It supports DMA
+ remapping.
+
 config VIRTIO_IOMMU
tristate "Virtio IOMMU driver"
depends on VIRTIO
diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-iommu.c
index e285a220c913..043dcff06511 100644
--- a/drivers/iommu/hyperv-iommu.c
+++ b/drivers/iommu/hyperv-iommu.c
@@ -12,7 +12,12 @@
 #include 
 #include 
 #include 
+#include 
 #include 
+#include 
+#include 
+#include 
+#include 
 
 #include 
 #include 
@@ -21,6 +26,9 @@
 #include 
 #include 
 #include 
+#include 
+
+#include 
 
 #include "irq_remapping.h"
 
@@ -338,3 +346,623 @@ static const struct irq_domain_ops 
hyperv_root_ir_domain_ops = {
 };
 
 #endif
+
+#ifdef CONFIG_HYPERV_ROOT_PVIOMMU
+
+/* DMA remapping support */
+struct hv_iommu_domain {
+   struct iommu_domain domain;
+   struct hv_iommu_dev *hv_iommu;
+
+   struct hv_input_device_domain device_domain;
+
+   spinlock_t mappings_lock;
+   struct rb_root_cached mappings;
+
+   u32 map_flags;
+   u64 pgsize_bitmap;
+};
+
+/* What hardware IOMMU? */
+enum {
+   INVALID = 0,
+   INTEL, /* Only Intel for now */
+} hw_iommu_type = { INVALID };
+
+static struct hv_iommu_domain hv_identity_domain, hv_null_domain;
+
+#define to_hv_iommu_domain(d) \
+   container_of(d, struct hv_iommu_domain, domain)
+
+struct hv_iommu_mapping {
+   phys_addr_t paddr;
+   struct interval_tree_node iova;
+   u32 flags;
+};
+
+struct hv_iommu_dev {
+   struct iommu_device iommu;
+
+   struct ida domain_ids;
+
+   /* Device configuration */
+   struct iommu_domain_geometry geometry;
+   u64 first_domain;
+   u64 last_domain;
+
+   u32 map_flags;
+   u64 pgsize_bitmap;
+};
+
+static struct hv_iommu_dev *hv_iommu_device;
+
+struct hv_iommu_endpoint {
+   struct device *dev;
+   struct hv_iommu_dev *hv_iommu;
+   struct hv_iommu_domain *domain;
+};
+
+static void __init hv_initialize_special_domains(void)
+{
+   struct hv_iommu_domain *domain;
+
+   /* Default passthrough domain */
+   domain = &hv_identity_domain;
+
+   memset(domain, 0, sizeof(*domain));
+
+   domain->device_domain.partition_id = HV_PARTITION_ID_SELF;
+   domain->device_domain.domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
+   domain->device_domain.domain_id.id = HV_DEVICE_DOMAIN_ID_S2_DEFAULT;
+
+   domain->domain.geometry = hv_iommu_device->geometry;
+
+   /* NULL domain that blocks all DMA transactions */
+   domain = &hv_null_domain;
+
+   memset(domain, 0, sizeof(*domain));
+
+   domain->device_domain.partition_id = HV_PARTITION_ID_SELF;
+   domain->device_domain.domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
+   domain->device_domain.domain_id.id = HV_DEVICE_DOMAIN_ID_S2_NULL;
+
+   domain->domain.geometry = hv_iommu_device->geometry;
+}
+
+static bool is_identity_domain(struct hv_iommu_domain *d)
+{
+   return d->device_domain.domain_id.id == HV_DEVICE_DOMAIN_ID_S2_DEFAULT;
+}
+
+static bool is_null_domain(struct hv_iommu_domain *d)
+{
+   return d->device_domain.domain_id.id == HV_DEVICE_DOMAIN_ID_S2_NULL;
+}
+
+static struct iommu_domain *hv_iommu_domain_alloc(unsigned type)
+{
+   struct hv_iommu_domain *domain;
+   int ret;
+   u64 status;
+   unsigned long flags;
+   struct hv_input_create_device_domain *input;
+
+   if (type == IOMMU_DOMAIN_IDENTITY)
+   return &hv_identity_domain.domain;
+
+   if (type == IOMMU_DOMAIN_BLOCKED)
+   return &hv_null_domain.domain;
+
+   domain = kzalloc(sizeof(*domain), GFP_KERNEL);
+   if (!domain)
+   goto out;
+
+   spin_lock_init(&domain->mappings_lock);
+   domain->mappings = RB_ROOT_CACHED;
+
+   if (type == IOMMU_DOMAIN_DMA &&
+   iommu_get_dma_cookie(&domain->domain)) {
+   goto out_free;
+   }
+
+   ret = ida_alloc_

[RFC v1 3/8] intel/vt-d: make DMAR table parsing code more flexible

2021-07-09 Thread Wei Liu
Microsoft Hypervisor provides a set of hypercalls to manage device
domains. The root kernel should parse the DMAR so that it can program
the IOMMU (with hypercalls) correctly.

The DMAR code was designed to work with Intel IOMMU only. Add two more
parameters to make it useful to Microsoft Hypervisor. Microsoft
Hypervisor does not need the DMAR parsing code to allocate an Intel
IOMMU structure; it also wishes to always reparse the DMAR table even
after it has been parsed before.

Adjust Intel IOMMU code to use the new dmar_table_init. There should be
no functional change to Intel IOMMU code.

Signed-off-by: Wei Liu 
---
We may be able to combine alloc and force_parse?
---
 drivers/iommu/intel/dmar.c  | 38 -
 drivers/iommu/intel/iommu.c |  2 +-
 drivers/iommu/intel/irq_remapping.c |  2 +-
 include/linux/dmar.h|  2 +-
 4 files changed, 30 insertions(+), 14 deletions(-)

diff --git a/drivers/iommu/intel/dmar.c b/drivers/iommu/intel/dmar.c
index 84057cb9596c..bd72f47c728b 100644
--- a/drivers/iommu/intel/dmar.c
+++ b/drivers/iommu/intel/dmar.c
@@ -408,7 +408,8 @@ dmar_find_dmaru(struct acpi_dmar_hardware_unit *drhd)
  * structure which uniquely represent one DMA remapping hardware unit
  * present in the platform
  */
-static int dmar_parse_one_drhd(struct acpi_dmar_header *header, void *arg)
+static int dmar_parse_one_drhd_internal(struct acpi_dmar_header *header,
+   void *arg, bool alloc)
 {
struct acpi_dmar_hardware_unit *drhd;
struct dmar_drhd_unit *dmaru;
@@ -440,12 +441,14 @@ static int dmar_parse_one_drhd(struct acpi_dmar_header 
*header, void *arg)
return -ENOMEM;
}
 
-   ret = alloc_iommu(dmaru);
-   if (ret) {
-   dmar_free_dev_scope(&dmaru->devices,
-   &dmaru->devices_cnt);
-   kfree(dmaru);
-   return ret;
+   if (alloc) {
+   ret = alloc_iommu(dmaru);
+   if (ret) {
+   dmar_free_dev_scope(&dmaru->devices,
+   &dmaru->devices_cnt);
+   kfree(dmaru);
+   return ret;
+   }
}
dmar_register_drhd_unit(dmaru);
 
@@ -456,6 +459,16 @@ static int dmar_parse_one_drhd(struct acpi_dmar_header 
*header, void *arg)
return 0;
 }
 
+static int dmar_parse_one_drhd(struct acpi_dmar_header *header, void *arg)
+{
+   return dmar_parse_one_drhd_internal(header, arg, true);
+}
+
+int dmar_parse_one_drhd_noalloc(struct acpi_dmar_header *header, void *arg)
+{
+   return dmar_parse_one_drhd_internal(header, arg, false);
+}
+
 static void dmar_free_drhd(struct dmar_drhd_unit *dmaru)
 {
if (dmaru->devices && dmaru->devices_cnt)
@@ -633,7 +646,7 @@ static inline int dmar_walk_dmar_table(struct 
acpi_table_dmar *dmar,
  * parse_dmar_table - parses the DMA reporting table
  */
 static int __init
-parse_dmar_table(void)
+parse_dmar_table(bool alloc)
 {
struct acpi_table_dmar *dmar;
int drhd_count = 0;
@@ -650,6 +663,9 @@ parse_dmar_table(void)
.cb[ACPI_DMAR_TYPE_SATC] = &dmar_parse_one_satc,
};
 
+   if (!alloc)
+   cb.cb[ACPI_DMAR_TYPE_HARDWARE_UNIT] = 
&dmar_parse_one_drhd_noalloc;
+
/*
 * Do it again, earlier dmar_tbl mapping could be mapped with
 * fixed map.
@@ -840,13 +856,13 @@ void __init dmar_register_bus_notifier(void)
 }
 
 
-int __init dmar_table_init(void)
+int __init dmar_table_init(bool alloc, bool force_parse)
 {
static int dmar_table_initialized;
int ret;
 
-   if (dmar_table_initialized == 0) {
-   ret = parse_dmar_table();
+   if (dmar_table_initialized == 0 || force_parse) {
+   ret = parse_dmar_table(alloc);
if (ret < 0) {
if (ret != -ENODEV)
pr_info("Parse DMAR table failure.\n");
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index be35284a2016..a4294d310b93 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -4310,7 +4310,7 @@ int __init intel_iommu_init(void)
}
 
down_write(&dmar_global_lock);
-   if (dmar_table_init()) {
+   if (dmar_table_init(true, false)) {
if (force_on)
panic("tboot: Failed to initialize DMAR table\n");
goto out_free_dmar;
diff --git a/drivers/iommu/intel/irq_remapping.c 
b/drivers/iommu/intel/irq_remapping.c
index f912fe45bea2..0e8abef862e4 100644
--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -732,7 +732,7 @@ static int __init intel_prepare_irq_remapping(void)
return -ENODEV;
}
 
-   if (dmar_table_init() < 0)
+   if (dmar_table_init(true, false) < 0)
return -ENODEV;
 
if (intel_cap_audit(CAP_AUDIT_S

Re: [bug report] iommu_dma_unmap_sg() is very slow then running IO from remote numa node

2021-07-09 Thread John Garry

On 09/07/2021 11:26, Robin Murphy wrote:

n 2021-07-09 09:38, Ming Lei wrote:

Hello,

I observed that NVMe performance is very bad when running fio on one
CPU(aarch64) in remote numa node compared with the nvme pci numa node.

Please see the test result[1] 327K vs. 34.9K.

Latency trace shows that one big difference is in iommu_dma_unmap_sg(),
 nsecs vs 25437 nsecs.


Are you able to dig down further into that? iommu_dma_unmap_sg() itself 
doesn't do anything particularly special, so whatever makes a difference 
is probably happening at a lower level, and I suspect there's probably 
an SMMU involved. If for instance it turns out to go all the way down to 
__arm_smmu_cmdq_poll_until_consumed() because polling MMIO from the 
wrong node is slow, there's unlikely to be much you can do about that 
other than the global "go faster" knobs (iommu.strict and 
iommu.passthrough) with their associated compromises.


There was also the disable_msipolling option:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c#n42

But I am not sure if that platform even supports MSI polling (or has 
smmu v3).


You could also try iommu.forcedac=1 cmdline option. But I doubt it will 
help since the issue was mentioned to be NUMA related.




Robin.


[1] fio test & results

1) fio test result:

- run fio on local CPU
taskset -c 0 ~/git/tools/test/nvme/io_uring 10 1 /dev/nvme1n1 4k
+ fio --bs=4k --ioengine=io_uring --fixedbufs --registerfiles --hipri 
--iodepth=64 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 
--filename=/dev/nvme1n1 --direct=1 --runtime=10 --numjobs=1 
--rw=randread --name=test --group_reporting


IOPS: 327K
avg latency of iommu_dma_unmap_sg():  nsecs


- run fio on remote CPU
taskset -c 80 ~/git/tools/test/nvme/io_uring 10 1 /dev/nvme1n1 4k
+ fio --bs=4k --ioengine=io_uring --fixedbufs --registerfiles --hipri 
--iodepth=64 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 
--filename=/dev/nvme1n1 --direct=1 --runtime=10 --numjobs=1 
--rw=randread --name=test --group_reporting


IOPS: 34.9K
avg latency of iommu_dma_unmap_sg(): 25437 nsecs

2) system info
[root@ampere-mtjade-04 ~]# lscpu | grep NUMA
NUMA node(s):    2
NUMA node0 CPU(s):   0-79
NUMA node1 CPU(s):   80-159

lspci | grep NVMe
0003:01:00.0 Non-Volatile memory controller: Samsung Electronics Co 
Ltd NVMe SSD Controller SM981/PM981/PM983


[root@ampere-mtjade-04 ~]# cat /sys/block/nvme1n1/device/device/numa_node 


Since it's ampere, I guess it's smmu v3.

BTW, if you remember, I did raise a performance issue of smmuv3 with 
NVMe before:

https://lore.kernel.org/linux-iommu/b2a6e26d-6d0d-7f0d-f222-589812f70...@huawei.com/

I did have this series to improve performance for systems with lots of 
CPUs, like above, but not accepted:

https://lore.kernel.org/linux-iommu/1598018062-175608-1-git-send-email-john.ga...@huawei.com/

Thanks,
John

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [bug report] iommu_dma_unmap_sg() is very slow then running IO from remote numa node

2021-07-09 Thread Robin Murphy

On 2021-07-09 09:38, Ming Lei wrote:

Hello,

I observed that NVMe performance is very bad when running fio on one
CPU(aarch64) in remote numa node compared with the nvme pci numa node.

Please see the test result[1] 327K vs. 34.9K.

Latency trace shows that one big difference is in iommu_dma_unmap_sg(),
 nsecs vs 25437 nsecs.


Are you able to dig down further into that? iommu_dma_unmap_sg() itself 
doesn't do anything particularly special, so whatever makes a difference 
is probably happening at a lower level, and I suspect there's probably 
an SMMU involved. If for instance it turns out to go all the way down to 
__arm_smmu_cmdq_poll_until_consumed() because polling MMIO from the 
wrong node is slow, there's unlikely to be much you can do about that 
other than the global "go faster" knobs (iommu.strict and 
iommu.passthrough) with their associated compromises.


Robin.


[1] fio test & results

1) fio test result:

- run fio on local CPU
taskset -c 0 ~/git/tools/test/nvme/io_uring 10 1 /dev/nvme1n1 4k
+ fio --bs=4k --ioengine=io_uring --fixedbufs --registerfiles --hipri 
--iodepth=64 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 
--filename=/dev/nvme1n1 --direct=1 --runtime=10 --numjobs=1 --rw=randread 
--name=test --group_reporting

IOPS: 327K
avg latency of iommu_dma_unmap_sg():  nsecs


- run fio on remote CPU
taskset -c 80 ~/git/tools/test/nvme/io_uring 10 1 /dev/nvme1n1 4k
+ fio --bs=4k --ioengine=io_uring --fixedbufs --registerfiles --hipri 
--iodepth=64 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 
--filename=/dev/nvme1n1 --direct=1 --runtime=10 --numjobs=1 --rw=randread 
--name=test --group_reporting

IOPS: 34.9K
avg latency of iommu_dma_unmap_sg(): 25437 nsecs

2) system info
[root@ampere-mtjade-04 ~]# lscpu | grep NUMA
NUMA node(s):2
NUMA node0 CPU(s):   0-79
NUMA node1 CPU(s):   80-159

lspci | grep NVMe
0003:01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe 
SSD Controller SM981/PM981/PM983

[root@ampere-mtjade-04 ~]# cat /sys/block/nvme1n1/device/device/numa_node
0



Thanks,
Ming


___
linux-arm-kernel mailing list
linux-arm-ker...@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [bug report] iommu_dma_unmap_sg() is very slow then running IO from remote numa node

2021-07-09 Thread Russell King (Oracle)
On Fri, Jul 09, 2021 at 04:38:09PM +0800, Ming Lei wrote:
> I observed that NVMe performance is very bad when running fio on one
> CPU(aarch64) in remote numa node compared with the nvme pci numa node.

Have you checked the effect of running a memory-heavy process using
memory from node 1 while being executed by CPUs in node 0?

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[bug report] iommu_dma_unmap_sg() is very slow then running IO from remote numa node

2021-07-09 Thread Ming Lei
Hello,

I observed that NVMe performance is very bad when running fio on one
CPU(aarch64) in remote numa node compared with the nvme pci numa node.

Please see the test result[1] 327K vs. 34.9K.

Latency trace shows that one big difference is in iommu_dma_unmap_sg(),
 nsecs vs 25437 nsecs.


[1] fio test & results

1) fio test result:

- run fio on local CPU
taskset -c 0 ~/git/tools/test/nvme/io_uring 10 1 /dev/nvme1n1 4k
+ fio --bs=4k --ioengine=io_uring --fixedbufs --registerfiles --hipri 
--iodepth=64 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 
--filename=/dev/nvme1n1 --direct=1 --runtime=10 --numjobs=1 --rw=randread 
--name=test --group_reporting

IOPS: 327K
avg latency of iommu_dma_unmap_sg():  nsecs


- run fio on remote CPU
taskset -c 80 ~/git/tools/test/nvme/io_uring 10 1 /dev/nvme1n1 4k
+ fio --bs=4k --ioengine=io_uring --fixedbufs --registerfiles --hipri 
--iodepth=64 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 
--filename=/dev/nvme1n1 --direct=1 --runtime=10 --numjobs=1 --rw=randread 
--name=test --group_reporting

IOPS: 34.9K
avg latency of iommu_dma_unmap_sg(): 25437 nsecs

2) system info
[root@ampere-mtjade-04 ~]# lscpu | grep NUMA
NUMA node(s):2
NUMA node0 CPU(s):   0-79
NUMA node1 CPU(s):   80-159

lspci | grep NVMe
0003:01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe 
SSD Controller SM981/PM981/PM983

[root@ampere-mtjade-04 ~]# cat /sys/block/nvme1n1/device/device/numa_node
0



Thanks, 
Ming

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC v2] /dev/iommu uAPI proposal

2021-07-09 Thread Tian, Kevin
/dev/iommu provides an unified interface for managing I/O page tables for 
devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA, 
etc.) are expected to use this interface instead of creating their own logic to 
isolate untrusted device DMAs initiated by userspace. 

This proposal describes the uAPI of /dev/iommu and also sample sequences 
with VFIO as example in typical usages. The driver-facing kernel API provided 
by the iommu layer is still TBD, which can be discussed after consensus is 
made on this uAPI.

It's based on a lengthy discussion starting from here:

https://lore.kernel.org/linux-iommu/20210330132830.go2356...@nvidia.com/ 

v1 can be found here:

https://lore.kernel.org/linux-iommu/ph0pr12mb54811863b392c644e5365446dc...@ph0pr12mb5481.namprd12.prod.outlook.com/T/

This doc is also tracked on github, though it's not very useful for v1->v2 
given dramatic refactoring:
https://github.com/luxis1999/dev_iommu_uapi 

Changelog (v1->v2):
- Rename /dev/ioasid to /dev/iommu (Jason);
- Add a section for device-centric vs. group-centric design (many);
- Add a section for handling no-snoop DMA (Jason/Alex/Paolo);
- Add definition of user/kernel/shared I/O page tables (Baolu/Jason);
- Allow one device bound to multiple iommu fd's (Jason);
- No need to track user I/O page tables in kernel on ARM/AMD (Jean/Jason);
- Add a device cookie for iotlb invalidation and fault handling (Jean/Jason);
- Add capability/format query interface per device cookie (Jason);
- Specify format/attribute when creating an IOASID, leading to several v1
  uAPI commands removed (Jason);
- Explain the value of software nesting (Jean);
- Replace IOASID_REGISTER_VIRTUAL_MEMORY with software nesting (David/Jason);
- Cover software mdev usage (Jason);
- No restriction on map/unmap vs. bind/invalidate (Jason/David);
- Report permitted IOVA range instead of reserved range (David);
- Refine the sample structures and helper functions (Jason);
- Add definition of default and non-default I/O address spaces;
- Expand and clarify the design for PASID virtualization;
- and lots of subtle refinement according to above changes;

TOC

1. Terminologies and Concepts
1.1. Manage I/O address space
1.2. Attach device to I/O address space
1.3. Group isolation
1.4. PASID virtualization
1.4.1. Devices which don't support DMWr
1.4.2. Devices which support DMWr
1.4.3. Mix different types together
1.4.4. User sequence
1.5. No-snoop DMA
2. uAPI Proposal
2.1. /dev/iommu uAPI
2.2. /dev/vfio device uAPI
2.3. /dev/kvm uAPI
3. Sample Structures and Helper Functions
4. Use Cases and Flows
4.1. A simple example
4.2. Multiple IOASIDs (no nesting)
4.3. IOASID nesting (software)
4.4. IOASID nesting (hardware)
4.5. Guest SVA (vSVA)
4.6. I/O page fault


1. Terminologies and Concepts
-

IOMMU fd is the container holding multiple I/O address spaces. User 
manages those address spaces through fd operations. Multiple fd's are 
allowed per process, but with this proposal one fd should be sufficient for 
all intended usages.

IOASID is the fd-local software handle representing an I/O address space. 
Each IOASID is associated with a single I/O page table. IOASIDs can be 
nested together, implying the output address from one I/O page table 
(represented by child IOASID) must be further translated by another I/O 
page table (represented by parent IOASID).

An I/O address space takes effect only after it is attached by a device. 
One device is allowed to attach to multiple I/O address spaces. One I/O 
address space can be attached by multiple devices.

Device must be bound to an IOMMU fd before attach operation can be
conducted. Though not necessary, user could bind one device to multiple
IOMMU FD's. But no cross-FD IOASID nesting is allowed.

The format of an I/O page table must be compatible to the attached 
devices (or more specifically to the IOMMU which serves the DMA from
the attached devices). User is responsible for specifying the format
when allocating an IOASID, according to one or multiple devices which
will be attached right after. Attaching a device to an IOASID with 
incompatible format is simply rejected.

Relationship between IOMMU fd, VFIO fd and KVM fd:

-   IOMMU fd provides uAPI for managing IOASIDs and I/O page tables. 
It also provides an unified capability/format reporting interface for
each bound device. 

-   VFIO fd provides uAPI for device binding and attaching. In this proposal 
VFIO is used as the example of device passthrough frameworks. The
routing information that identifies an I/O address space in the wire is 
per-device and registered to IOMMU fd via VFIO uAPI.

-   KVM fd provides uAPI for handling no-snoop DMA and PASID virtualization
in CPU (when PASID is carried in instruction payload).

1.1. Manage I/O address space
+++

Re: [PATCH 0/4] Add dynamic iommu backed bounce buffers

2021-07-09 Thread David Stevens
On Fri, Jul 9, 2021 at 2:14 AM Robin Murphy  wrote:
>
> On 2021-07-08 10:29, Joerg Roedel wrote:
> > Adding Robin too.
> >
> > On Wed, Jul 07, 2021 at 04:55:01PM +0900, David Stevens wrote:
> >> Add support for per-domain dynamic pools of iommu bounce buffers to the
> >> dma-iommu API. This allows iommu mappings to be reused while still
> >> maintaining strict iommu protection. Allocating buffers dynamically
> >> instead of using swiotlb carveouts makes per-domain pools more amenable
> >> on systems with large numbers of devices or where devices are unknown.
>
> But isn't that just as true for the currently-supported case? All you
> need is a large enough Thunderbolt enclosure and you could suddenly plug
> in a dozen untrusted GPUs all wanting to map hundreds of megabytes of
> memory. If there's a real concern worth addressing, surely it's worth
> addressing properly for everyone.

Bounce buffers consume memory, so there is always going to be some
limitation on how many devices are supported. This patch series limits
the memory consumption at a given point in time to approximately the
amount of active DMA transactions. There's really no way to improve
significantly on that. The 'approximately' qualification could be
removed by adding a shrinker, but that doesn't change things
materially.

This is compared to reusing swiotlb, where the amount of memory
consumed would be the largest amount of active DMA transactions you
want bounce buffers to handle. I see two concrete shortcomings here.
First, most of the time you're not doing heavy IO, especially for
consumer workloads. Second, it raises the problem of per-device
tuning, since you don't want to waste performance by having too few
bounce buffers but you also don't want to waste memory by
preallocating too many bounce buffers. This tuning becomes more
problematic once you start dealing with external devices.

Also, although this doesn't directly address the raised concern, the
bounce buffers are only used for relatively small DMA transactions. So
large allocations like framebuffers won't actually consume extra
memory via bounce buffers.

> >> When enabled, all non-direct streaming mappings below a configurable
> >> size will go through bounce buffers. Note that this means drivers which
> >> don't properly use the DMA API (e.g. i915) cannot use an iommu when this
> >> feature is enabled. However, all drivers which work with swiotlb=force
> >> should work.
> >>
> >> Bounce buffers serve as an optimization in situations where interactions
> >> with the iommu are very costly. For example, virtio-iommu operations in
> >> a guest on a linux host require a vmexit, involvement the VMM, and a
> >> VFIO syscall. For relatively small DMA operations, memcpy can be
> >> significantly faster.
>
> Yup, back when the bounce-buffering stuff first came up I know
> networking folks were interested in terms of latency for small packets -
> virtualised IOMMUs are indeed another interesting case I hadn't thought
> of. It's definitely been on the radar as another use-case we'd like to
> accommodate with the bounce-buffering scheme. However, that's the thing:
> bouncing is bouncing and however you look at it it still overlaps so
> much with the untrusted case - there's no reason that couldn't use
> pre-mapped bounce buffers too, for instance - that the only necessary
> difference is really the policy decision of when to bounce. iommu-dma
> has already grown complicated enough, and having *three* different ways
> of doing things internally just seems bonkers and untenable. Pre-map the
> bounce buffers? Absolutely. Dynamically grow them on demand? Yes please!
> Do it all as a special thing in its own NIH module and leave the
> existing mess to rot? Sorry, but no.

I do agree that iommu-dma is getting fairly complicated. Since a
virtualized IOMMU uses bounce buffers much more heavily than
sub-granule untrusted DMA, and for the reasons stated earlier in this
email, I don't think pre-allocated bounce buffers are viable for the
virtualized IOMMU case. I can look at migrating the sub-granule
untrusted DMA case to dynamic bounce buffers, if that's an acceptable
approach.

-David

> Thanks,
> Robin.
>
> >> As a performance comparison, on a device with an i5-10210U, I ran fio
> >> with a VFIO passthrough NVMe drive with '--direct=1 --rw=read
> >> --ioengine=libaio --iodepth=64' and block sizes 4k, 16k, 64k, and
> >> 128k. Test throughput increased by 2.8x, 4.7x, 3.6x, and 3.6x. Time
> >> spent in iommu_dma_unmap_(page|sg) per GB processed decreased by 97%,
> >> 94%, 90%, and 87%. Time spent in iommu_dma_map_(page|sg) decreased
> >> by >99%, as bounce buffers don't require syncing here in the read case.
> >> Running with multiple jobs doesn't serve as a useful performance
> >> comparison because virtio-iommu and vfio_iommu_type1 both have big
> >> locks that significantly limit mulithreaded DMA performance.
> >>
> >> This patch set is based on v5.13-rc7 plus the patches at [1].
> >>
> >> Davi