Re: [PATCH 4/5] iommu/arm-smmu: Make way to add Qcom's smmu-500 errata handling

2018-08-14 Thread Robin Murphy

On 14/08/18 11:55, Vivek Gautam wrote:

Cleanup to re-use some of the stuff

Signed-off-by: Vivek Gautam 
---
  drivers/iommu/arm-smmu.c | 32 +---
  1 file changed, 25 insertions(+), 7 deletions(-)


I think the overall diffstat would be an awful lot smaller if the 
erratum workaround just has its own readl_poll_timeout() as it does in 
the vendor kernel. The burst-polling loop is for minimising latency in 
high-throughput situations, and if you're in a workaround which has to 
lock *every* register write and issue two firmware calls around each 
sync I think you're already well out of that game.



diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index 32e86df80428..75c146751c87 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -391,21 +391,31 @@ static void __arm_smmu_free_bitmap(unsigned long *map, 
int idx)
clear_bit(idx, map);
  }
  
-/* Wait for any pending TLB invalidations to complete */

-static void __arm_smmu_tlb_sync(struct arm_smmu_device *smmu,
-   void __iomem *sync, void __iomem *status)
+static int __arm_smmu_tlb_sync_wait(void __iomem *status)
  {
unsigned int spin_cnt, delay;
  
-	writel_relaxed(0, sync);

for (delay = 1; delay < TLB_LOOP_TIMEOUT; delay *= 2) {
for (spin_cnt = TLB_SPIN_COUNT; spin_cnt > 0; spin_cnt--) {
if (!(readl_relaxed(status) & sTLBGSTATUS_GSACTIVE))
-   return;
+   return 0;
cpu_relax();
}
udelay(delay);
}
+
+   return -EBUSY;
+}
+
+/* Wait for any pending TLB invalidations to complete */
+static void __arm_smmu_tlb_sync(struct arm_smmu_device *smmu,
+   void __iomem *sync, void __iomem *status)
+{
+   writel_relaxed(0, sync);
+
+   if (!__arm_smmu_tlb_sync_wait(status))
+   return;
+
dev_err_ratelimited(smmu->dev,
"TLB sync timed out -- SMMU may be deadlocked\n");
  }
@@ -461,8 +471,9 @@ static void arm_smmu_tlb_inv_context_s2(void *cookie)
arm_smmu_tlb_sync_global(smmu);
  }
  
-static void arm_smmu_tlb_inv_range_nosync(unsigned long iova, size_t size,

- size_t granule, bool leaf, void 
*cookie)
+static void __arm_smmu_tlb_inv_range_nosync(unsigned long iova, size_t size,
+   size_t granule, bool leaf,
+   void *cookie)
  {
struct arm_smmu_domain *smmu_domain = cookie;
struct arm_smmu_cfg *cfg = _domain->cfg;
@@ -498,6 +509,13 @@ static void arm_smmu_tlb_inv_range_nosync(unsigned long 
iova, size_t size,
}
  }
  
+static void arm_smmu_tlb_inv_range_nosync(unsigned long iova, size_t size,

+ size_t granule, bool leaf,
+ void *cookie)
+{
+   __arm_smmu_tlb_inv_range_nosync(iova, size, granule, leaf, cookie);
+}
+


AFAICS even after patch #5 this does absolutely nothing except make the 
code needlessly harder to read :(


Robin.


  /*
   * On MMU-401 at least, the cost of firing off multiple TLBIVMIDs appears
   * almost negligible, but the benefit of getting the first one in as far ahead


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 3/3] dts: arm64/sdm845: Add node for qcom,smmu-v2

2018-08-14 Thread Vivek Gautam
Adding Jordan here.

On Tue, Aug 14, 2018 at 4:19 PM, Robin Murphy  wrote:
> Hi Vivek,
>
> On 14/08/18 11:27, Vivek Gautam wrote:
>>
>> Add device node for qcom,smmu-v2 available on sdm845.
>> This smmu is available only to GPU device.
>>
>> Signed-off-by: Vivek Gautam 
>> ---
>>   arch/arm64/boot/dts/qcom/sdm845.dtsi | 23 +++
>>   1 file changed, 23 insertions(+)
>>
>> diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi
>> b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>> index 1c2be2082f33..bd1ec5fa5146 100644
>> --- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
>> +++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>> @@ -6,6 +6,7 @@
>>*/
>> #include 
>> +#include 
>>   #include 
>>   #include 
>>   #include 
>> @@ -989,6 +990,28 @@
>> cell-index = <0>;
>> };
>>   + gpu_smmu: iommu@504 {
>> +   compatible = "qcom,sdm845-smmu-v2",
>> "qcom,smmu-v2";
>
>
> Which of "sdm845" or "msm8996"[1] is the actual SoC name here?

Well, the bindings use the SoC prefix with smmu-v2, so it should be
sdm845 for this SoC. This is same as I posted in my v1 of the series [2].
Using 8996 based string in sdm845 makes things look awful.

Thanks
Vivek

[2] https://patchwork.kernel.org/patch/10534989/

>
> Robin.
>
> [1]
> https://www.mail-archive.com/freedreno@lists.freedesktop.org/msg02659.html
>
>> +   reg = <0x504 0x1>;
>> +   #iommu-cells = <1>;
>> +   #global-interrupts = <2>;
>> +   interrupts = ,
>> +,
>> +,
>> +,
>> +,
>> +,
>> +,
>> +,
>> +,
>> +;
>> +   clock-names = "bus", "iface";
>> +   clocks = < GCC_GPU_MEMNOC_GFX_CLK>,
>> +< GCC_GPU_CFG_AHB_CLK>;
>> +
>> +   /*power-domains = < GPU_CX_GDSC>;*/
>> +   };
>> +
>> apps_smmu: iommu@1500 {
>> compatible = "qcom,sdm845-smmu-500",
>> "arm,mmu-500";
>> reg = <0x1500 0x8>;
>>
> ___
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu



-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 4/6] iommu/io-pgtable-arm: add support for non-strict mode

2018-08-14 Thread Leizhen (ThunderTown)



On 2018/8/6 9:32, Yang, Shunyong wrote:
> Hi, Robin,
> 
> On 2018/7/26 22:37, Robin Murphy wrote:
>> On 2018-07-26 8:20 AM, Leizhen (ThunderTown) wrote:
>>> On 2018/7/25 6:25, Robin Murphy wrote:
 On 2018-07-12 7:18 AM, Zhen Lei wrote:
> To support the non-strict mode, now we only tlbi and sync for the strict
> mode. But for the non-leaf case, always follow strict mode.
>
> Use the lowest bit of the iova parameter to pass the strict mode:
> 0, IOMMU_STRICT;
> 1, IOMMU_NON_STRICT;
> Treat 0 as IOMMU_STRICT, so that the unmap operation can compatible with
> other IOMMUs which still use strict mode.
>
> Signed-off-by: Zhen Lei 
> ---
>drivers/iommu/io-pgtable-arm.c | 23 ++-
>1 file changed, 14 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/iommu/io-pgtable-arm.c 
> b/drivers/iommu/io-pgtable-arm.c
> index 010a254..9234db3 100644
> --- a/drivers/iommu/io-pgtable-arm.c
> +++ b/drivers/iommu/io-pgtable-arm.c
> @@ -292,7 +292,7 @@ static void __arm_lpae_set_pte(arm_lpae_iopte *ptep, 
> arm_lpae_iopte pte,
>  static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
>   unsigned long iova, size_t size, int lvl,
> -   arm_lpae_iopte *ptep);
> +   arm_lpae_iopte *ptep, int strict);
>  static void __arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
>phys_addr_t paddr, arm_lpae_iopte prot,
> @@ -334,7 +334,7 @@ static int arm_lpae_init_pte(struct 
> arm_lpae_io_pgtable *data,
>size_t sz = ARM_LPAE_BLOCK_SIZE(lvl, data);
>  tblp = ptep - ARM_LPAE_LVL_IDX(iova, lvl, data);
> -if (WARN_ON(__arm_lpae_unmap(data, iova, sz, lvl, tblp) != sz))
> +if (WARN_ON(__arm_lpae_unmap(data, iova, sz, lvl, tblp, 
> IOMMU_STRICT) != sz))
>return -EINVAL;
>}
>@@ -531,7 +531,7 @@ static void arm_lpae_free_pgtable(struct 
> io_pgtable *iop)
>static size_t arm_lpae_split_blk_unmap(struct arm_lpae_io_pgtable 
> *data,
>   unsigned long iova, size_t size,
>   arm_lpae_iopte blk_pte, int lvl,
> -   arm_lpae_iopte *ptep)
> +   arm_lpae_iopte *ptep, int strict)

 DMA code should never ever be splitting blocks anyway, and frankly the TLB 
 maintenance here is dodgy enough (since we can't reasonably do 
 break-before make as VMSA says we should) that I *really* don't want to 
 introduce any possibility of making it more asynchronous. I'd much rather 
 just hard-code the expectation of strict == true for this.
>>>
>>> OK, I will hard-code strict=true for it.
>>>
>>> But since it never ever be happened, why did not give a warning at the 
>>> beginning?
>>
>> Because DMA code is not the only caller of iommu_map/unmap. It's 
>> perfectly legal in the IOMMU API to partially unmap a previous mapping 
>> such that a block entry needs to be split. The DMA API, however, is a 
>> lot more constrined, and thus by construction the iommu-dma layer will 
>> never generate a block-splitting iommu_unmap() except as a result of 
>> illegal DMA API usage, and we obviously do not need to optimise for that 
>> (you will get a warning about mismatched unmaps under dma-debug, but 
>> it's a bit too expensive to police in the general case).
>>
> 
> When I was reading the code around arm_lpae_split_blk_unmap(), I was
> curious in which scenario a block will be split. Now with your comments
> "Because DMA code is not the only caller of iommu_map/unmap", it seems
> depending on the user.
> 
> Would you please explain this further? I mean besides DMA, which user
> will use iommu_map/umap and how it split a block.

I also think that arm_lpae_split_blk_unmap() scenario is not exist, maybe
we should remove it, and give a warning for this wrong usage.

> 
> Thanks.
> Shunyong.
> 
>>
>{
>struct io_pgtable_cfg *cfg = >iop.cfg;
>arm_lpae_iopte pte, *tablep;
> @@ -576,15 +576,18 @@ static size_t arm_lpae_split_blk_unmap(struct 
> arm_lpae_io_pgtable *data,
>}
>  if (unmap_idx < 0)
> -return __arm_lpae_unmap(data, iova, size, lvl, tablep);
> +return __arm_lpae_unmap(data, iova, size, lvl, tablep, strict);
>  io_pgtable_tlb_add_flush(>iop, iova, size, size, true);
> +if (!strict)
> +io_pgtable_tlb_sync(>iop);
> +
>return size;
>}
>  static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
>   unsigned long iova, size_t size, int lvl,
> -   arm_lpae_iopte *ptep)
> +   arm_lpae_iopte *ptep, int strict)
>{
>arm_lpae_iopte pte;
>struct io_pgtable *iop = 

Re: Question on iommu_get_domain_for_dev() for DMA map/unmap

2018-08-14 Thread John Garry

On 14/08/2018 11:45, Robin Murphy wrote:

Hi John,



Hi Robin,


On 14/08/18 11:09, John Garry wrote:

Hi All,

I have a question on function iommu_get_domain_for_dev() in DMA
mapping path, and why we need to get+put a reference to the iommu group.

The background is that we have been testing iperf throughput
performance for a PCIe NIC card behind an SMMUv3, with small packets
and many threads (128) on a system with a lot cores (>64).

We find that the reference counting on the iommu group in
iommu_dma_map_sg()->iommu_get_domain_for_dev()->kobject_get()/put() is
a big bottleneck. There is much contention on getting and putting the
reference to iommu_group.device_kobj.

The question is do we actually need to get+put this reference for DMA
map/unmap, since the device already takes a reference to the
iommu_group.device_kobj when attached to the group? Is it to protect
against the device being removed and the group being freed as it is
dereferenced? If so, I would say it's not safe to reference the
associated iommu domain thereafter in the DMA map/unmap, as it would
be freed along with the iommu group.


Thanks for getting back to me.



Yes, I agree that the reference is something we don't strictly need - if
the group disappears while the device is in the middle of DMA things are
beyond broken. However, the problem for iommu-dma as "generic" code was
that since dev->iommu_group is opaque, there was no other way to
actually retrieve the domain pointer.

Now that hardware exists to show a real and measurable impact it
probably is time to revisit this. I have an idea for what I think is a
reasonable patch - let me throw that together...



So we had some patches floating around internally to work around this 
ref counting bottleneck by using RCU or per-cpu refcount, and they did 
improve performance considerably. I did like them so much, as they kept 
the same fundamental functionality.


Anyway we would appreciate any patch and test it.


Robin.


All the best,
John



.




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 4/6] iommu/io-pgtable-arm: add support for non-strict mode

2018-08-14 Thread Robin Murphy

On 14/08/18 09:35, Will Deacon wrote:

On Tue, Aug 14, 2018 at 04:33:41PM +0800, Leizhen (ThunderTown) wrote:

On 2018/8/6 9:32, Yang, Shunyong wrote:

On 2018/7/26 22:37, Robin Murphy wrote:

Because DMA code is not the only caller of iommu_map/unmap. It's
perfectly legal in the IOMMU API to partially unmap a previous mapping
such that a block entry needs to be split. The DMA API, however, is a
lot more constrined, and thus by construction the iommu-dma layer will
never generate a block-splitting iommu_unmap() except as a result of
illegal DMA API usage, and we obviously do not need to optimise for that
(you will get a warning about mismatched unmaps under dma-debug, but
it's a bit too expensive to police in the general case).



When I was reading the code around arm_lpae_split_blk_unmap(), I was
curious in which scenario a block will be split. Now with your comments
"Because DMA code is not the only caller of iommu_map/unmap", it seems
depending on the user.

Would you please explain this further? I mean besides DMA, which user
will use iommu_map/umap and how it split a block.


I also think that arm_lpae_split_blk_unmap() scenario is not exist, maybe
we should remove it, and give a warning for this wrong usage.


Can't it happen with VFIO?


...or GPU drivers, or anyone else managing their own IOMMU domain 
directly. A sequence like this is perfectly legal:


iommu_map(domain, iova, paddr, SZ_8M, prot);
...
iommu_unmap(domain, iova + SZ_1M * 5, SZ_1M * 3);

where if iova and paddr happen to be suitably aligned, the map will lay 
down blocks, and the unmap will then have to split one of them into 
pages to remove half of it. We don't tear our hair out maintaining 
split_blk_unmap() for the fun of it :(


Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 2/5] firmware/qcom_scm: Add atomic version of io read/write APIs

2018-08-14 Thread Vivek Gautam
Add atomic versions of qcom_scm_io_readl/writel to enable
reading/writing secure registers from atomic context.

Signed-off-by: Vivek Gautam 
---
 drivers/firmware/qcom_scm-32.c | 12 
 drivers/firmware/qcom_scm-64.c | 32 
 drivers/firmware/qcom_scm.c| 12 
 drivers/firmware/qcom_scm.h|  4 
 include/linux/qcom_scm.h   |  4 
 5 files changed, 64 insertions(+)

diff --git a/drivers/firmware/qcom_scm-32.c b/drivers/firmware/qcom_scm-32.c
index 4e24e591ae74..7293e5efad69 100644
--- a/drivers/firmware/qcom_scm-32.c
+++ b/drivers/firmware/qcom_scm-32.c
@@ -627,3 +627,15 @@ int __qcom_scm_io_writel(struct device *dev, phys_addr_t 
addr, unsigned int val)
return qcom_scm_call_atomic2(QCOM_SCM_SVC_IO, QCOM_SCM_IO_WRITE,
 addr, val);
 }
+
+int __qcom_scm_io_readl_atomic(struct device *dev, phys_addr_t addr,
+  unsigned int *val)
+{
+   return -ENODEV;
+}
+
+int __qcom_scm_io_writel_atomic(struct device *dev, phys_addr_t addr,
+   unsigned int val)
+{
+   return -ENODEV;
+}
diff --git a/drivers/firmware/qcom_scm-64.c b/drivers/firmware/qcom_scm-64.c
index 3a8c867cdf51..6bf55403f6e3 100644
--- a/drivers/firmware/qcom_scm-64.c
+++ b/drivers/firmware/qcom_scm-64.c
@@ -558,3 +558,35 @@ int __qcom_scm_io_writel(struct device *dev, phys_addr_t 
addr, unsigned int val)
return qcom_scm_call(dev, QCOM_SCM_SVC_IO, QCOM_SCM_IO_WRITE,
 , );
 }
+
+int __qcom_scm_io_readl_atomic(struct device *dev, phys_addr_t addr,
+  unsigned int *val)
+{
+   struct qcom_scm_desc desc = {0};
+   struct arm_smccc_res res;
+   int ret;
+
+   desc.args[0] = addr;
+   desc.arginfo = QCOM_SCM_ARGS(1);
+
+   ret = qcom_scm_call_atomic(dev, QCOM_SCM_SVC_IO, QCOM_SCM_IO_READ,
+  , );
+   if (ret >= 0)
+   *val = res.a1;
+
+   return ret < 0 ? ret : 0;
+}
+
+int __qcom_scm_io_writel_atomic(struct device *dev, phys_addr_t addr,
+   unsigned int val)
+{
+   struct qcom_scm_desc desc = {0};
+   struct arm_smccc_res res;
+
+   desc.args[0] = addr;
+   desc.args[1] = val;
+   desc.arginfo = QCOM_SCM_ARGS(2);
+
+   return qcom_scm_call_atomic(dev, QCOM_SCM_SVC_IO, QCOM_SCM_IO_WRITE,
+   , );
+}
diff --git a/drivers/firmware/qcom_scm.c b/drivers/firmware/qcom_scm.c
index e778af766fae..36dab37f 100644
--- a/drivers/firmware/qcom_scm.c
+++ b/drivers/firmware/qcom_scm.c
@@ -365,6 +365,18 @@ int qcom_scm_io_writel(phys_addr_t addr, unsigned int val)
 }
 EXPORT_SYMBOL(qcom_scm_io_writel);
 
+int qcom_scm_io_readl_atomic(phys_addr_t addr, unsigned int *val)
+{
+   return __qcom_scm_io_readl_atomic(__scm->dev, addr, val);
+}
+EXPORT_SYMBOL(qcom_scm_io_readl_atomic);
+
+int qcom_scm_io_writel_atomic(phys_addr_t addr, unsigned int val)
+{
+   return __qcom_scm_io_writel_atomic(__scm->dev, addr, val);
+}
+EXPORT_SYMBOL(qcom_scm_io_writel_atomic);
+
 static void qcom_scm_set_download_mode(bool enable)
 {
bool avail;
diff --git a/drivers/firmware/qcom_scm.h b/drivers/firmware/qcom_scm.h
index dcd7f7917fc7..bb176107f51e 100644
--- a/drivers/firmware/qcom_scm.h
+++ b/drivers/firmware/qcom_scm.h
@@ -37,6 +37,10 @@ extern void __qcom_scm_cpu_power_down(u32 flags);
 #define QCOM_SCM_IO_WRITE  0x2
 extern int __qcom_scm_io_readl(struct device *dev, phys_addr_t addr, unsigned 
int *val);
 extern int __qcom_scm_io_writel(struct device *dev, phys_addr_t addr, unsigned 
int val);
+extern int __qcom_scm_io_readl_atomic(struct device *dev, phys_addr_t addr,
+ unsigned int *val);
+extern int __qcom_scm_io_writel_atomic(struct device *dev, phys_addr_t addr,
+  unsigned int val);
 
 #define QCOM_SCM_SVC_INFO  0x6
 #define QCOM_IS_CALL_AVAIL_CMD 0x1
diff --git a/include/linux/qcom_scm.h b/include/linux/qcom_scm.h
index 5d65521260b3..6a5d0c98b328 100644
--- a/include/linux/qcom_scm.h
+++ b/include/linux/qcom_scm.h
@@ -64,6 +64,8 @@ extern int qcom_scm_iommu_secure_ptbl_size(u32 spare, size_t 
*size);
 extern int qcom_scm_iommu_secure_ptbl_init(u64 addr, u32 size, u32 spare);
 extern int qcom_scm_io_readl(phys_addr_t addr, unsigned int *val);
 extern int qcom_scm_io_writel(phys_addr_t addr, unsigned int val);
+extern int qcom_scm_io_readl_atomic(phys_addr_t addr, unsigned int *val);
+extern int qcom_scm_io_writel_atomic(phys_addr_t addr, unsigned int val);
 #else
 static inline
 int qcom_scm_set_cold_boot_addr(void *entry, const cpumask_t *cpus)
@@ -100,5 +102,7 @@ static inline int qcom_scm_iommu_secure_ptbl_size(u32 
spare, size_t *size) { ret
 static inline int qcom_scm_iommu_secure_ptbl_init(u64 addr, u32 size, u32 
spare) { return -ENODEV; }
 static inline int 

[PATCH 0/5] Qcom smmu-500 TLB invalidation errata for sdm845

2018-08-14 Thread Vivek Gautam
Qcom's implementation of arm,mmu-500 on sdm845 has a functional/performance
errata [1] because of which the TCU cache look ups are stalled during
invalidation cycle. This is mitigated by serializing all the invalidation
requests coming to the smmu.

This patch series addresses this errata by adding new tlb_ops for
qcom,sdm845-smmu-500 [2]. These ops take context bank locks for all the
tlb_ops that queue and sync the TLB invalidation requests.

Besides adding locks, there's a way to expadite these TLB invalidations
for display and camera devices by turning off the 'wait-for-safe' logic
in hardware that holds the tlb invalidations until a safe level.
This 'wait-for-safe' logic is controlled by toggling a chicken bit
through a secure register. This secure register is accessed by making an
explicit SCM call into the EL3 firmware.
There are two ways of handling this logic -
 * Firmware, such as tz present on sdm845-mtp devices has a handler to do
   all the register access and bit set/clear. So is the handling in
   downstream arm-smmu driver [3].
 * Other firmwares can have handlers to just read/write this secure
   register. In such cases the kernel make io_read/writel scm calls to
   modify the register.
This patch series adds APIs in qcom-scm driver to handle both of these
cases.

Lastly, since these TLB invalidations can happen in atomic contexts
there's a need to add atomic versions of qcom_scm_io_readl/writel() and
qcom_scm_call() APIs. The traditional scm calls take mutex and we therefore
can't use these calls in atomic contexts.

This patch series is adapted version of how the errata is handled in
downstream [1].

[1] 
https://source.codeaurora.org/quic/la/kernel/msm-4.9/tree/drivers/iommu/arm-smmu.c?h=msm-4.9#n4842
[2] https://lore.kernel.org/patchwork/patch/974114/
[3] 
https://source.codeaurora.org/quic/la/kernel/msm-4.9/tree/drivers/iommu/arm-smmu.c?h=msm-4.9#n4864

Vivek Gautam (5):
  firmware: qcom_scm-64: Add atomic version of qcom_scm_call
  firmware/qcom_scm: Add atomic version of io read/write APIs
  firmware/qcom_scm: Add scm call to handle smmu errata
  iommu/arm-smmu: Make way to add Qcom's smmu-500 errata handling
  iommu/arm-smmu: Add support to handle Qcom's TLBI serialization errata

 drivers/firmware/qcom_scm-32.c |  17 
 drivers/firmware/qcom_scm-64.c | 181 +++--
 drivers/firmware/qcom_scm.c|  18 
 drivers/firmware/qcom_scm.h|   9 ++
 drivers/iommu/arm-smmu-regs.h  |   2 +
 drivers/iommu/arm-smmu.c   | 168 --
 include/linux/qcom_scm.h   |   6 ++
 7 files changed, 348 insertions(+), 53 deletions(-)

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 3/5] firmware/qcom_scm: Add scm call to handle smmu errata

2018-08-14 Thread Vivek Gautam
Qcom's smmu-500 needs to toggle wait-for-safe sequence to
handle TLB invalidation sync's.
Few firmwares allow doing that through SCM interface.
Add API to toggle wait for safe from firmware through a
SCM call.

Signed-off-by: Vivek Gautam 
---
 drivers/firmware/qcom_scm-32.c |  5 +
 drivers/firmware/qcom_scm-64.c | 13 +
 drivers/firmware/qcom_scm.c|  6 ++
 drivers/firmware/qcom_scm.h|  5 +
 include/linux/qcom_scm.h   |  2 ++
 5 files changed, 31 insertions(+)

diff --git a/drivers/firmware/qcom_scm-32.c b/drivers/firmware/qcom_scm-32.c
index 7293e5efad69..2d301ad053f8 100644
--- a/drivers/firmware/qcom_scm-32.c
+++ b/drivers/firmware/qcom_scm-32.c
@@ -639,3 +639,8 @@ int __qcom_scm_io_writel_atomic(struct device *dev, 
phys_addr_t addr,
 {
return -ENODEV;
 }
+
+int __qcom_scm_qsmmu500_wait_safe_toggle(struct device *dev, bool enable)
+{
+   return -ENODEV;
+}
diff --git a/drivers/firmware/qcom_scm-64.c b/drivers/firmware/qcom_scm-64.c
index 6bf55403f6e3..f13bcabc5d78 100644
--- a/drivers/firmware/qcom_scm-64.c
+++ b/drivers/firmware/qcom_scm-64.c
@@ -590,3 +590,16 @@ int __qcom_scm_io_writel_atomic(struct device *dev, 
phys_addr_t addr,
return qcom_scm_call_atomic(dev, QCOM_SCM_SVC_IO, QCOM_SCM_IO_WRITE,
, );
 }
+
+int __qcom_scm_qsmmu500_wait_safe_toggle(struct device *dev, bool en)
+{
+   struct qcom_scm_desc desc = {0};
+   struct arm_smccc_res res;
+
+   desc.args[0] = QCOM_SCM_CONFIG_ERRATA1_CLIENT_ALL;
+   desc.args[1] = en;
+   desc.arginfo = QCOM_SCM_ARGS(2);
+
+   return qcom_scm_call_atomic(dev, QCOM_SCM_SVC_SMMU_PROGRAM,
+   QCOM_SCM_CONFIG_ERRATA1, , );
+}
diff --git a/drivers/firmware/qcom_scm.c b/drivers/firmware/qcom_scm.c
index 36dab37f..5f15cc2e9f69 100644
--- a/drivers/firmware/qcom_scm.c
+++ b/drivers/firmware/qcom_scm.c
@@ -353,6 +353,12 @@ int qcom_scm_iommu_secure_ptbl_init(u64 addr, u32 size, 
u32 spare)
 }
 EXPORT_SYMBOL(qcom_scm_iommu_secure_ptbl_init);
 
+int qcom_scm_qsmmu500_wait_safe_toggle(bool en)
+{
+   return __qcom_scm_qsmmu500_wait_safe_toggle(__scm->dev, en);
+}
+EXPORT_SYMBOL(qcom_scm_qsmmu500_wait_safe_toggle);
+
 int qcom_scm_io_readl(phys_addr_t addr, unsigned int *val)
 {
return __qcom_scm_io_readl(__scm->dev, addr, val);
diff --git a/drivers/firmware/qcom_scm.h b/drivers/firmware/qcom_scm.h
index bb176107f51e..89a822c23e33 100644
--- a/drivers/firmware/qcom_scm.h
+++ b/drivers/firmware/qcom_scm.h
@@ -103,10 +103,15 @@ extern int __qcom_scm_restore_sec_cfg(struct device *dev, 
u32 device_id,
  u32 spare);
 #define QCOM_SCM_IOMMU_SECURE_PTBL_SIZE3
 #define QCOM_SCM_IOMMU_SECURE_PTBL_INIT4
+#define QCOM_SCM_SVC_SMMU_PROGRAM  0x15
+#define QCOM_SCM_CONFIG_ERRATA10x3
+#define QCOM_SCM_CONFIG_ERRATA1_CLIENT_ALL 0x2
 extern int __qcom_scm_iommu_secure_ptbl_size(struct device *dev, u32 spare,
 size_t *size);
 extern int __qcom_scm_iommu_secure_ptbl_init(struct device *dev, u64 addr,
 u32 size, u32 spare);
+extern int __qcom_scm_qsmmu500_wait_safe_toggle(struct device *dev,
+   bool enable);
 #define QCOM_MEM_PROT_ASSIGN_ID0x16
 extern int  __qcom_scm_assign_mem(struct device *dev,
  phys_addr_t mem_region, size_t mem_sz,
diff --git a/include/linux/qcom_scm.h b/include/linux/qcom_scm.h
index 6a5d0c98b328..46e6b1692998 100644
--- a/include/linux/qcom_scm.h
+++ b/include/linux/qcom_scm.h
@@ -62,6 +62,7 @@ extern int qcom_scm_set_remote_state(u32 state, u32 id);
 extern int qcom_scm_restore_sec_cfg(u32 device_id, u32 spare);
 extern int qcom_scm_iommu_secure_ptbl_size(u32 spare, size_t *size);
 extern int qcom_scm_iommu_secure_ptbl_init(u64 addr, u32 size, u32 spare);
+extern int qcom_scm_qsmmu500_wait_safe_toggle(bool en);
 extern int qcom_scm_io_readl(phys_addr_t addr, unsigned int *val);
 extern int qcom_scm_io_writel(phys_addr_t addr, unsigned int val);
 extern int qcom_scm_io_readl_atomic(phys_addr_t addr, unsigned int *val);
@@ -100,6 +101,7 @@ qcom_scm_set_remote_state(u32 state,u32 id) { return 
-ENODEV; }
 static inline int qcom_scm_restore_sec_cfg(u32 device_id, u32 spare) { return 
-ENODEV; }
 static inline int qcom_scm_iommu_secure_ptbl_size(u32 spare, size_t *size) { 
return -ENODEV; }
 static inline int qcom_scm_iommu_secure_ptbl_init(u64 addr, u32 size, u32 
spare) { return -ENODEV; }
+static inline int qcom_scm_qsmmu500_wait_safe_toggle(bool en) { return 
-ENODEV; }
 static inline int qcom_scm_io_readl(phys_addr_t addr, unsigned int *val) { 
return -ENODEV; }
 static inline int qcom_scm_io_writel(phys_addr_t addr, unsigned int val) { 
return -ENODEV; }
 static inline int qcom_scm_io_readl_atomic(phys_addr_t addr, unsigned int 
*val) { return -ENODEV; }
-- 

[PATCH v2 3/3] dts: arm64/sdm845: Add node for qcom,smmu-v2

2018-08-14 Thread Vivek Gautam
Add device node for qcom,smmu-v2 available on sdm845.
This smmu is available only to GPU device.

Signed-off-by: Vivek Gautam 
---
 arch/arm64/boot/dts/qcom/sdm845.dtsi | 23 +++
 1 file changed, 23 insertions(+)

diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi 
b/arch/arm64/boot/dts/qcom/sdm845.dtsi
index 1c2be2082f33..bd1ec5fa5146 100644
--- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
+++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
@@ -6,6 +6,7 @@
  */
 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -989,6 +990,28 @@
cell-index = <0>;
};
 
+   gpu_smmu: iommu@504 {
+   compatible = "qcom,sdm845-smmu-v2", "qcom,smmu-v2";
+   reg = <0x504 0x1>;
+   #iommu-cells = <1>;
+   #global-interrupts = <2>;
+   interrupts = ,
+,
+,
+,
+,
+,
+,
+,
+,
+;
+   clock-names = "bus", "iface";
+   clocks = < GCC_GPU_MEMNOC_GFX_CLK>,
+< GCC_GPU_CFG_AHB_CLK>;
+
+   /*power-domains = < GPU_CX_GDSC>;*/
+   };
+
apps_smmu: iommu@1500 {
compatible = "qcom,sdm845-smmu-500", "arm,mmu-500";
reg = <0x1500 0x8>;
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 4/6] iommu/io-pgtable-arm: add support for non-strict mode

2018-08-14 Thread Will Deacon
On Tue, Aug 14, 2018 at 04:33:41PM +0800, Leizhen (ThunderTown) wrote:
> On 2018/8/6 9:32, Yang, Shunyong wrote:
> > On 2018/7/26 22:37, Robin Murphy wrote:
> >> Because DMA code is not the only caller of iommu_map/unmap. It's 
> >> perfectly legal in the IOMMU API to partially unmap a previous mapping 
> >> such that a block entry needs to be split. The DMA API, however, is a 
> >> lot more constrined, and thus by construction the iommu-dma layer will 
> >> never generate a block-splitting iommu_unmap() except as a result of 
> >> illegal DMA API usage, and we obviously do not need to optimise for that 
> >> (you will get a warning about mismatched unmaps under dma-debug, but 
> >> it's a bit too expensive to police in the general case).
> >>
> > 
> > When I was reading the code around arm_lpae_split_blk_unmap(), I was
> > curious in which scenario a block will be split. Now with your comments
> > "Because DMA code is not the only caller of iommu_map/unmap", it seems
> > depending on the user.
> > 
> > Would you please explain this further? I mean besides DMA, which user
> > will use iommu_map/umap and how it split a block.
> 
> I also think that arm_lpae_split_blk_unmap() scenario is not exist, maybe
> we should remove it, and give a warning for this wrong usage.

Can't it happen with VFIO?

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 0/3] Enable smmu support on sdm845

2018-08-14 Thread Vivek Gautam
This series enables apps-smmu (arm,mmu-500) and gpu-smmu (qcom,smmu-v2)
on sdm845. gpu-smmu needs one power domain from gpu clock controller
whose driver was sent by Amit [1].

Changes since v1:
 - Addressed Rob's review comments by adding a SoC specific compatible.
   Have added a new dt-bindings patch for this.
 - Updated node name to 'iommu'.
 - Addressed Doug's review comment about removing status property from
   smmu's nodes, as smmu is either present on the soc or not. Enabling
   it is not a board-level decision.

[1] https://lore.kernel.org/patchwork/patch/973839/

Vivek Gautam (3):
  dt-bindings: arm-smmu: Add binding doc for Qcom smmu-500
  dts: arm64/sdm845: Add node for arm,mmu-500
  dts: arm64/sdm845: Add node for qcom,smmu-v2

 .../devicetree/bindings/iommu/arm,smmu.txt |  5 ++
 arch/arm64/boot/dts/qcom/sdm845.dtsi   | 95 ++
 2 files changed, 100 insertions(+)

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 1/3] dt-bindings: arm-smmu: Add binding doc for Qcom smmu-500

2018-08-14 Thread Vivek Gautam
Qcom's implementation of arm,mmu-500 works well with current
arm-smmu driver implementation. Adding a soc specific compatible
along with arm,mmu-500 makes the bindings future safe.

Signed-off-by: Vivek Gautam 
---
 Documentation/devicetree/bindings/iommu/arm,smmu.txt | 5 +
 1 file changed, 5 insertions(+)

diff --git a/Documentation/devicetree/bindings/iommu/arm,smmu.txt 
b/Documentation/devicetree/bindings/iommu/arm,smmu.txt
index 7c71a6ed465a..7d73b2a259fc 100644
--- a/Documentation/devicetree/bindings/iommu/arm,smmu.txt
+++ b/Documentation/devicetree/bindings/iommu/arm,smmu.txt
@@ -18,6 +18,7 @@ conditions.
 "arm,mmu-500"
 "cavium,smmu-v2"
 "qcom,-smmu-v2", "qcom,smmu-v2"
+"qcom,-smmu-500", "arm,mmu-500"
 
   depending on the particular implementation and/or the
   version of the architecture implemented.
@@ -30,6 +31,10 @@ conditions.
   An example string would be -
   "qcom,msm8996-smmu-v2", "qcom,smmu-v2".
 
+ "qcom,-smmu-500" compatible string represents qcom's soc
+ specific implementation of arm,mmu-500, and should be present
+ along with "arm,mmu-500".
+
 - reg   : Base address and size of the SMMU.
 
 - #global-interrupts : The number of global interrupts exposed by the
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 2/3] dts: arm64/sdm845: Add node for arm,mmu-500

2018-08-14 Thread Vivek Gautam
Add device node for arm,mmu-500 available on sdm845.
This MMU-500 with single TCU and multiple TBU architecture
is shared among all the peripherals except gpu on sdm845.

Signed-off-by: Vivek Gautam 
---
 arch/arm64/boot/dts/qcom/sdm845.dtsi | 72 
 1 file changed, 72 insertions(+)

diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi 
b/arch/arm64/boot/dts/qcom/sdm845.dtsi
index eb4ab33bf6f4..1c2be2082f33 100644
--- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
+++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
@@ -989,6 +989,78 @@
cell-index = <0>;
};
 
+   apps_smmu: iommu@1500 {
+   compatible = "qcom,sdm845-smmu-500", "arm,mmu-500";
+   reg = <0x1500 0x8>;
+   #iommu-cells = <2>;
+   #global-interrupts = <1>;
+   interrupts = ,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+,
+;
+   };
+
apss_shared: mailbox@1799 {
compatible = "qcom,sdm845-apss-shared";
reg = <0x1799 0x1000>;
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 5/5] iommu/arm-smmu: Add support to handle Qcom's TLBI serialization errata

2018-08-14 Thread Vivek Gautam
Qcom's implementation of arm,mmu-500 require to serialize all
TLB invalidations for context banks.
In case the TLB invalidation requests don't go through the first
time, there's a way to disable/enable the wait for safe logic.
Disabling this logic expadites the TLBIs.

Different bootloaders with their access control policies allow this
register access differntly. With one, we should be able to directly
make qcom-scm call to do io read/write, while with other we should
use the specific SCM command to send request to do the complete
register configuration.
A separate device tree flag for arm-smmu will allow to identify
which firmware configuration of the two mentioned above we use.

Signed-off-by: Vivek Gautam 
---
 drivers/iommu/arm-smmu-regs.h |   2 +
 drivers/iommu/arm-smmu.c  | 136 +-
 2 files changed, 136 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/arm-smmu-regs.h b/drivers/iommu/arm-smmu-regs.h
index a1226e4ab5f8..71662cae9806 100644
--- a/drivers/iommu/arm-smmu-regs.h
+++ b/drivers/iommu/arm-smmu-regs.h
@@ -177,6 +177,8 @@ enum arm_smmu_s2cr_privcfg {
 #define ARM_SMMU_CB_ATS1PR 0x800
 #define ARM_SMMU_CB_ATSR   0x8f0
 
+#define ARM_SMMU_GID_QCOM_CUSTOM_CFG   0x300
+
 #define SCTLR_S1_ASIDPNE   (1 << 12)
 #define SCTLR_CFCFG(1 << 7)
 #define SCTLR_CFIE (1 << 6)
diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index 75c146751c87..fafdaeb4d097 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -48,6 +48,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -179,7 +180,8 @@ struct arm_smmu_device {
 #define ARM_SMMU_FEAT_EXIDS(1 << 12)
u32 features;
 
-#define ARM_SMMU_OPT_SECURE_CFG_ACCESS (1 << 0)
+#define ARM_SMMU_OPT_SECURE_CFG_ACCESS  (1 << 0)
+#define ARM_SMMU_OPT_QCOM_FW_IMPL_ERRATA (1 << 1)
u32 options;
enum arm_smmu_arch_version  version;
enum arm_smmu_implementationmodel;
@@ -262,6 +264,7 @@ static bool using_legacy_binding, using_generic_binding;
 
 static struct arm_smmu_option_prop arm_smmu_options[] = {
{ ARM_SMMU_OPT_SECURE_CFG_ACCESS, "calxeda,smmu-secure-config-access" },
+   { ARM_SMMU_OPT_QCOM_FW_IMPL_ERRATA, "qcom,smmu-500-fw-impl-errata" },
{ 0, NULL},
 };
 
@@ -531,12 +534,137 @@ static void arm_smmu_tlb_inv_vmid_nosync(unsigned long 
iova, size_t size,
writel_relaxed(smmu_domain->cfg.vmid, base + ARM_SMMU_GR0_TLBIVMID);
 }
 
+#define CUSTOM_CFG_MDP_SAFE_ENABLE BIT(15)
+#define CUSTOM_CFG_IFE1_SAFE_ENABLEBIT(14)
+#define CUSTOM_CFG_IFE0_SAFE_ENABLEBIT(13)
+
+static int __qsmmu500_wait_safe_toggle(struct arm_smmu_device *smmu, int en)
+{
+   int ret;
+   u32 val, gid_phys_base;
+   phys_addr_t reg;
+   struct vm_struct *vm;
+
+   /* We want physical address of SMMU, so the vm_area */
+   vm = find_vm_area(smmu->base);
+
+   /*
+* GID (implementation defined address space) is located at
+* SMMU_BASE + (2 × PAGESIZE).
+*/
+   gid_phys_base = vm->phys_addr + (2 << (smmu)->pgshift);
+   reg = gid_phys_base + ARM_SMMU_GID_QCOM_CUSTOM_CFG;
+
+   ret = qcom_scm_io_readl_atomic(reg, );
+   if (ret)
+   return ret;
+
+   if (en)
+   val |= CUSTOM_CFG_MDP_SAFE_ENABLE |
+  CUSTOM_CFG_IFE0_SAFE_ENABLE |
+  CUSTOM_CFG_IFE1_SAFE_ENABLE;
+   else
+   val &= ~(CUSTOM_CFG_MDP_SAFE_ENABLE |
+CUSTOM_CFG_IFE0_SAFE_ENABLE |
+CUSTOM_CFG_IFE1_SAFE_ENABLE);
+
+   ret = qcom_scm_io_writel_atomic(reg, val);
+
+   return ret;
+}
+
+static int qsmmu500_wait_safe_toggle(struct arm_smmu_device *smmu,
+int en, bool is_fw_impl)
+{
+   if (is_fw_impl)
+   return qcom_scm_qsmmu500_wait_safe_toggle(en);
+   else
+   return __qsmmu500_wait_safe_toggle(smmu, en);
+}
+
+static void qcom_errata_tlb_sync(struct arm_smmu_domain *smmu_domain)
+{
+   struct arm_smmu_device *smmu = smmu_domain->smmu;
+   void __iomem *base = ARM_SMMU_CB(smmu, smmu_domain->cfg.cbndx);
+   void __iomem *status = base + ARM_SMMU_CB_TLBSTATUS;
+   bool is_fw_impl;
+
+   writel_relaxed(0, base + ARM_SMMU_CB_TLBSYNC);
+
+   if (!__arm_smmu_tlb_sync_wait(status))
+   return;
+
+   is_fw_impl = smmu->options & ARM_SMMU_OPT_QCOM_FW_IMPL_ERRATA ?
+   true : false;
+
+   /* SCM call here to disable the wait-for-safe logic. */
+   if (WARN(qsmmu500_wait_safe_toggle(smmu, false, is_fw_impl),
+"Failed to disable wait-safe logic, bad hw state\n"))
+   return;
+
+   if (!__arm_smmu_tlb_sync_wait(status))
+   return;
+
+ 

[PATCH 4/5] iommu/arm-smmu: Make way to add Qcom's smmu-500 errata handling

2018-08-14 Thread Vivek Gautam
Cleanup to re-use some of the stuff

Signed-off-by: Vivek Gautam 
---
 drivers/iommu/arm-smmu.c | 32 +---
 1 file changed, 25 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index 32e86df80428..75c146751c87 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -391,21 +391,31 @@ static void __arm_smmu_free_bitmap(unsigned long *map, 
int idx)
clear_bit(idx, map);
 }
 
-/* Wait for any pending TLB invalidations to complete */
-static void __arm_smmu_tlb_sync(struct arm_smmu_device *smmu,
-   void __iomem *sync, void __iomem *status)
+static int __arm_smmu_tlb_sync_wait(void __iomem *status)
 {
unsigned int spin_cnt, delay;
 
-   writel_relaxed(0, sync);
for (delay = 1; delay < TLB_LOOP_TIMEOUT; delay *= 2) {
for (spin_cnt = TLB_SPIN_COUNT; spin_cnt > 0; spin_cnt--) {
if (!(readl_relaxed(status) & sTLBGSTATUS_GSACTIVE))
-   return;
+   return 0;
cpu_relax();
}
udelay(delay);
}
+
+   return -EBUSY;
+}
+
+/* Wait for any pending TLB invalidations to complete */
+static void __arm_smmu_tlb_sync(struct arm_smmu_device *smmu,
+   void __iomem *sync, void __iomem *status)
+{
+   writel_relaxed(0, sync);
+
+   if (!__arm_smmu_tlb_sync_wait(status))
+   return;
+
dev_err_ratelimited(smmu->dev,
"TLB sync timed out -- SMMU may be deadlocked\n");
 }
@@ -461,8 +471,9 @@ static void arm_smmu_tlb_inv_context_s2(void *cookie)
arm_smmu_tlb_sync_global(smmu);
 }
 
-static void arm_smmu_tlb_inv_range_nosync(unsigned long iova, size_t size,
- size_t granule, bool leaf, void 
*cookie)
+static void __arm_smmu_tlb_inv_range_nosync(unsigned long iova, size_t size,
+   size_t granule, bool leaf,
+   void *cookie)
 {
struct arm_smmu_domain *smmu_domain = cookie;
struct arm_smmu_cfg *cfg = _domain->cfg;
@@ -498,6 +509,13 @@ static void arm_smmu_tlb_inv_range_nosync(unsigned long 
iova, size_t size,
}
 }
 
+static void arm_smmu_tlb_inv_range_nosync(unsigned long iova, size_t size,
+ size_t granule, bool leaf,
+ void *cookie)
+{
+   __arm_smmu_tlb_inv_range_nosync(iova, size, granule, leaf, cookie);
+}
+
 /*
  * On MMU-401 at least, the cost of firing off multiple TLBIVMIDs appears
  * almost negligible, but the benefit of getting the first one in as far ahead
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Question on iommu_get_domain_for_dev() for DMA map/unmap

2018-08-14 Thread John Garry

Hi All,

I have a question on function iommu_get_domain_for_dev() in DMA mapping 
path, and why we need to get+put a reference to the iommu group.


The background is that we have been testing iperf throughput performance 
for a PCIe NIC card behind an SMMUv3, with small packets and many 
threads (128) on a system with a lot cores (>64).


We find that the reference counting on the iommu group in 
iommu_dma_map_sg()->iommu_get_domain_for_dev()->kobject_get()/put() is a 
big bottleneck. There is much contention on getting and putting the 
reference to iommu_group.device_kobj.


The question is do we actually need to get+put this reference for DMA 
map/unmap, since the device already takes a reference to the 
iommu_group.device_kobj when attached to the group? Is it to protect 
against the device being removed and the group being freed as it is 
dereferenced? If so, I would say it's not safe to reference the 
associated iommu domain thereafter in the DMA map/unmap, as it would be 
freed along with the iommu group.


Thanks in advance,
John

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: Question on iommu_get_domain_for_dev() for DMA map/unmap

2018-08-14 Thread Robin Murphy

Hi John,

On 14/08/18 11:09, John Garry wrote:

Hi All,

I have a question on function iommu_get_domain_for_dev() in DMA mapping 
path, and why we need to get+put a reference to the iommu group.


The background is that we have been testing iperf throughput performance 
for a PCIe NIC card behind an SMMUv3, with small packets and many 
threads (128) on a system with a lot cores (>64).


We find that the reference counting on the iommu group in 
iommu_dma_map_sg()->iommu_get_domain_for_dev()->kobject_get()/put() is a 
big bottleneck. There is much contention on getting and putting the 
reference to iommu_group.device_kobj.


The question is do we actually need to get+put this reference for DMA 
map/unmap, since the device already takes a reference to the 
iommu_group.device_kobj when attached to the group? Is it to protect 
against the device being removed and the group being freed as it is 
dereferenced? If so, I would say it's not safe to reference the 
associated iommu domain thereafter in the DMA map/unmap, as it would be 
freed along with the iommu group.


Yes, I agree that the reference is something we don't strictly need - if 
the group disappears while the device is in the middle of DMA things are 
beyond broken. However, the problem for iommu-dma as "generic" code was 
that since dev->iommu_group is opaque, there was no other way to 
actually retrieve the domain pointer.


Now that hardware exists to show a real and measurable impact it 
probably is time to revisit this. I have an idea for what I think is a 
reasonable patch - let me throw that together...


Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 3/3] dts: arm64/sdm845: Add node for qcom,smmu-v2

2018-08-14 Thread Robin Murphy

Hi Vivek,

On 14/08/18 11:27, Vivek Gautam wrote:

Add device node for qcom,smmu-v2 available on sdm845.
This smmu is available only to GPU device.

Signed-off-by: Vivek Gautam 
---
  arch/arm64/boot/dts/qcom/sdm845.dtsi | 23 +++
  1 file changed, 23 insertions(+)

diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi 
b/arch/arm64/boot/dts/qcom/sdm845.dtsi
index 1c2be2082f33..bd1ec5fa5146 100644
--- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
+++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
@@ -6,6 +6,7 @@
   */
  
  #include 

+#include 
  #include 
  #include 
  #include 
@@ -989,6 +990,28 @@
cell-index = <0>;
};
  
+		gpu_smmu: iommu@504 {

+   compatible = "qcom,sdm845-smmu-v2", "qcom,smmu-v2";


Which of "sdm845" or "msm8996"[1] is the actual SoC name here?

Robin.

[1] 
https://www.mail-archive.com/freedreno@lists.freedesktop.org/msg02659.html



+   reg = <0x504 0x1>;
+   #iommu-cells = <1>;
+   #global-interrupts = <2>;
+   interrupts = ,
+,
+,
+,
+,
+,
+,
+,
+,
+;
+   clock-names = "bus", "iface";
+   clocks = < GCC_GPU_MEMNOC_GFX_CLK>,
+< GCC_GPU_CFG_AHB_CLK>;
+
+   /*power-domains = < GPU_CX_GDSC>;*/
+   };
+
apps_smmu: iommu@1500 {
compatible = "qcom,sdm845-smmu-500", "arm,mmu-500";
reg = <0x1500 0x8>;


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 3/3] dts: arm64/sdm845: Add node for qcom,smmu-v2

2018-08-14 Thread Rob Herring
On Wed, Aug 15, 2018 at 01:09:43AM +0530, Vivek Gautam wrote:
> Adding Jordan here.
> 
> On Tue, Aug 14, 2018 at 4:19 PM, Robin Murphy  wrote:
> > Hi Vivek,
> >
> > On 14/08/18 11:27, Vivek Gautam wrote:
> >>
> >> Add device node for qcom,smmu-v2 available on sdm845.
> >> This smmu is available only to GPU device.
> >>
> >> Signed-off-by: Vivek Gautam 
> >> ---
> >>   arch/arm64/boot/dts/qcom/sdm845.dtsi | 23 +++
> >>   1 file changed, 23 insertions(+)
> >>
> >> diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi
> >> b/arch/arm64/boot/dts/qcom/sdm845.dtsi
> >> index 1c2be2082f33..bd1ec5fa5146 100644
> >> --- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
> >> +++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
> >> @@ -6,6 +6,7 @@
> >>*/
> >> #include 
> >> +#include 
> >>   #include 
> >>   #include 
> >>   #include 
> >> @@ -989,6 +990,28 @@
> >> cell-index = <0>;
> >> };
> >>   + gpu_smmu: iommu@504 {
> >> +   compatible = "qcom,sdm845-smmu-v2",
> >> "qcom,smmu-v2";
> >
> >
> > Which of "sdm845" or "msm8996"[1] is the actual SoC name here?
> 
> Well, the bindings use the SoC prefix with smmu-v2, so it should be
> sdm845 for this SoC. This is same as I posted in my v1 of the series [2].
> Using 8996 based string in sdm845 makes things look awful.

You need to list valid values of '' in the binding. Otherwise we 
get this confusion.

Rob
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 4/6] iommu/io-pgtable-arm: add support for non-strict mode

2018-08-14 Thread Yang, Shunyong
Hi, Robin,

On Tue, 2018-08-14 at 11:02 +0100, Robin Murphy wrote:
> On 14/08/18 09:35, Will Deacon wrote:
> > On Tue, Aug 14, 2018 at 04:33:41PM +0800, Leizhen (ThunderTown)
> > wrote:
> > > On 2018/8/6 9:32, Yang, Shunyong wrote:
> > > > On 2018/7/26 22:37, Robin Murphy wrote:
> > > > > Because DMA code is not the only caller of iommu_map/unmap.
> > > > > It's
> > > > > perfectly legal in the IOMMU API to partially unmap a
> > > > > previous mapping
> > > > > such that a block entry needs to be split. The DMA API,
> > > > > however, is a
> > > > > lot more constrined, and thus by construction the iommu-dma
> > > > > layer will
> > > > > never generate a block-splitting iommu_unmap() except as a
> > > > > result of
> > > > > illegal DMA API usage, and we obviously do not need to
> > > > > optimise for that
> > > > > (you will get a warning about mismatched unmaps under dma-
> > > > > debug, but
> > > > > it's a bit too expensive to police in the general case).
> > > > > 
> > > > 
> > > > When I was reading the code around arm_lpae_split_blk_unmap(),
> > > > I was
> > > > curious in which scenario a block will be split. Now with your
> > > > comments
> > > > "Because DMA code is not the only caller of iommu_map/unmap",
> > > > it seems
> > > > depending on the user.
> > > > 
> > > > Would you please explain this further? I mean besides DMA,
> > > > which user
> > > > will use iommu_map/umap and how it split a block.
> > > 
> > > I also think that arm_lpae_split_blk_unmap() scenario is not
> > > exist, maybe
> > > we should remove it, and give a warning for this wrong usage.
> > 
> > Can't it happen with VFIO?
> 
> ...or GPU drivers, or anyone else managing their own IOMMU domain 
> directly. A sequence like this is perfectly legal:
> 
>   iommu_map(domain, iova, paddr, SZ_8M, prot);
>   ...
>   iommu_unmap(domain, iova + SZ_1M * 5, SZ_1M * 3);
> 
> where if iova and paddr happen to be suitably aligned, the map will
> lay 
> down blocks, and the unmap will then have to split one of them into 
> pages to remove half of it. We don't tear our hair out maintaining 
> split_blk_unmap() for the fun of it :(

Thank you for the GPU example. But for VFIO, I remember all memory will
be   pinned in the early stage of emulator (such as qemu) start. So,
the split will occur at which operation? Maybe virtio balloon inflate?

Thanks.
Shunyong.

> 
> Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v5 3/5] iommu/io-pgtable-arm: add support for non-strict mode

2018-08-14 Thread Zhen Lei
To support the non-strict mode, now we only tlbi and sync for the strict
mode. But for the non-leaf case, always follow strict mode.

Signed-off-by: Zhen Lei 
---
 drivers/iommu/io-pgtable-arm.c | 20 ++--
 drivers/iommu/io-pgtable.h |  3 +++
 2 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 010a254..20d3e98 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -538,6 +538,7 @@ static size_t arm_lpae_split_blk_unmap(struct 
arm_lpae_io_pgtable *data,
phys_addr_t blk_paddr;
size_t tablesz = ARM_LPAE_GRANULE(data);
size_t split_sz = ARM_LPAE_BLOCK_SIZE(lvl, data);
+   size_t unmapped = size;
int i, unmap_idx = -1;

if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
@@ -575,11 +576,16 @@ static size_t arm_lpae_split_blk_unmap(struct 
arm_lpae_io_pgtable *data,
tablep = iopte_deref(pte, data);
}

-   if (unmap_idx < 0)
-   return __arm_lpae_unmap(data, iova, size, lvl, tablep);
+   if (unmap_idx < 0) {
+   unmapped = __arm_lpae_unmap(data, iova, size, lvl, tablep);
+   if (!(data->iop.cfg.quirks & IO_PGTABLE_QUIRK_NON_STRICT))
+   return unmapped;
+   }

io_pgtable_tlb_add_flush(>iop, iova, size, size, true);
-   return size;
+   io_pgtable_tlb_sync(>iop);
+
+   return unmapped;
 }

 static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
@@ -609,7 +615,7 @@ static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable 
*data,
io_pgtable_tlb_sync(iop);
ptep = iopte_deref(pte, data);
__arm_lpae_free_pgtable(data, lvl + 1, ptep);
-   } else {
+   } else if (!(iop->cfg.quirks & IO_PGTABLE_QUIRK_NON_STRICT)) {
io_pgtable_tlb_add_flush(iop, iova, size, size, true);
}

@@ -771,7 +777,8 @@ static void arm_lpae_restrict_pgsizes(struct io_pgtable_cfg 
*cfg)
u64 reg;
struct arm_lpae_io_pgtable *data;

-   if (cfg->quirks & ~(IO_PGTABLE_QUIRK_ARM_NS | IO_PGTABLE_QUIRK_NO_DMA))
+   if (cfg->quirks & ~(IO_PGTABLE_QUIRK_ARM_NS | IO_PGTABLE_QUIRK_NO_DMA |
+   IO_PGTABLE_QUIRK_NON_STRICT))
return NULL;

data = arm_lpae_alloc_pgtable(cfg);
@@ -863,7 +870,8 @@ static void arm_lpae_restrict_pgsizes(struct io_pgtable_cfg 
*cfg)
struct arm_lpae_io_pgtable *data;

/* The NS quirk doesn't apply at stage 2 */
-   if (cfg->quirks & ~IO_PGTABLE_QUIRK_NO_DMA)
+   if (cfg->quirks & ~(IO_PGTABLE_QUIRK_NO_DMA |
+   IO_PGTABLE_QUIRK_NON_STRICT))
return NULL;

data = arm_lpae_alloc_pgtable(cfg);
diff --git a/drivers/iommu/io-pgtable.h b/drivers/iommu/io-pgtable.h
index 2df7909..beb14a3 100644
--- a/drivers/iommu/io-pgtable.h
+++ b/drivers/iommu/io-pgtable.h
@@ -71,12 +71,15 @@ struct io_pgtable_cfg {
 *  be accessed by a fully cache-coherent IOMMU or CPU (e.g. for a
 *  software-emulated IOMMU), such that pagetable updates need not
 *  be treated as explicit DMA data.
+* IO_PGTABLE_QUIRK_NON_STRICT: Put off TLBs invalidation and release
+*  memory first.
 */
#define IO_PGTABLE_QUIRK_ARM_NS BIT(0)
#define IO_PGTABLE_QUIRK_NO_PERMS   BIT(1)
#define IO_PGTABLE_QUIRK_TLBI_ON_MAPBIT(2)
#define IO_PGTABLE_QUIRK_ARM_MTK_4GBBIT(3)
#define IO_PGTABLE_QUIRK_NO_DMA BIT(4)
+   #define IO_PGTABLE_QUIRK_NON_STRICT BIT(5)
unsigned long   quirks;
unsigned long   pgsize_bitmap;
unsigned intias;
--
1.8.3


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v5 1/5] iommu/arm-smmu-v3: fix the implementation of flush_iotlb_all hook

2018-08-14 Thread Zhen Lei
.flush_iotlb_all can not just wait for previous tlbi operations to be
completed, but should also invalid all TLBs of the related domain.

Signed-off-by: Zhen Lei 
Reviewed-by: Robin Murphy 
---
 drivers/iommu/arm-smmu-v3.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 1d64710..4402187 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -1770,6 +1770,14 @@ static int arm_smmu_map(struct iommu_domain *domain, 
unsigned long iova,
return ops->unmap(ops, iova, size);
 }

+static void arm_smmu_flush_iotlb_all(struct iommu_domain *domain)
+{
+   struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+
+   if (smmu_domain->smmu)
+   arm_smmu_tlb_inv_context(smmu_domain);
+}
+
 static void arm_smmu_iotlb_sync(struct iommu_domain *domain)
 {
struct arm_smmu_device *smmu = to_smmu_domain(domain)->smmu;
@@ -1998,7 +2006,7 @@ static void arm_smmu_put_resv_regions(struct device *dev,
.map= arm_smmu_map,
.unmap  = arm_smmu_unmap,
.map_sg = default_iommu_map_sg,
-   .flush_iotlb_all= arm_smmu_iotlb_sync,
+   .flush_iotlb_all= arm_smmu_flush_iotlb_all,
.iotlb_sync = arm_smmu_iotlb_sync,
.iova_to_phys   = arm_smmu_iova_to_phys,
.add_device = arm_smmu_add_device,
--
1.8.3


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v5 5/5] iommu/arm-smmu-v3: add bootup option "iommu.non_strict"

2018-08-14 Thread Zhen Lei
Add a bootup option to make the system manager can choose which mode to
be used. The default mode is strict.

Signed-off-by: Zhen Lei 
---
 Documentation/admin-guide/kernel-parameters.txt | 13 +
 drivers/iommu/arm-smmu-v3.c | 22 +-
 2 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 5cde1ff..cb9d043e 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1720,6 +1720,19 @@
nobypass[PPC/POWERNV]
Disable IOMMU bypass, using IOMMU for PCI devices.

+   iommu.non_strict=   [ARM64]
+   Format: { "0" | "1" }
+   0 - strict mode, default.
+   Release IOVAs after the related TLBs are invalid
+   completely.
+   1 - non-strict mode.
+   Put off TLBs invalidation and release memory first.
+   It's good for scatter-gather performance but lacks
+   full isolation, an untrusted device can access the
+   reused memory because the TLBs may still valid.
+   Please take full consideration before choosing this
+   mode. Note that, VFIO will always use strict mode.
+
iommu.passthrough=
[ARM64] Configure DMA to bypass the IOMMU by default.
Format: { "0" | "1" }
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 61eb7ec..0eda90e 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -631,6 +631,26 @@ struct arm_smmu_option_prop {
{ 0, NULL},
 };

+static bool smmu_non_strict __read_mostly;
+
+static int __init arm_smmu_setup(char *str)
+{
+   int ret;
+
+   ret = kstrtobool(str, _non_strict);
+   if (ret)
+   return ret;
+
+   if (smmu_non_strict) {
+   pr_warn("WARNING: iommu non-strict mode is chosen.\n"
+   "It's good for scatter-gather performance but lacks 
full isolation\n");
+   add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
+   }
+
+   return 0;
+}
+early_param("iommu.non_strict", arm_smmu_setup);
+
 static inline void __iomem *arm_smmu_page1_fixup(unsigned long offset,
 struct arm_smmu_device *smmu)
 {
@@ -1622,7 +1642,7 @@ static int arm_smmu_domain_finalise(struct iommu_domain 
*domain)
if (smmu->features & ARM_SMMU_FEAT_COHERENCY)
pgtbl_cfg.quirks = IO_PGTABLE_QUIRK_NO_DMA;

-   if (domain->type == IOMMU_DOMAIN_DMA) {
+   if ((domain->type == IOMMU_DOMAIN_DMA) && smmu_non_strict) {
domain->non_strict = true;
pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_NON_STRICT;
}
--
1.8.3


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v5 4/5] iommu/arm-smmu-v3: add support for non-strict mode

2018-08-14 Thread Zhen Lei
Dynamically choose strict or non-strict mode for page table config based
on the iommu domain type.

Signed-off-by: Zhen Lei 
---
 drivers/iommu/arm-smmu-v3.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 4402187..61eb7ec 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -1622,6 +1622,11 @@ static int arm_smmu_domain_finalise(struct iommu_domain 
*domain)
if (smmu->features & ARM_SMMU_FEAT_COHERENCY)
pgtbl_cfg.quirks = IO_PGTABLE_QUIRK_NO_DMA;

+   if (domain->type == IOMMU_DOMAIN_DMA) {
+   domain->non_strict = true;
+   pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_NON_STRICT;
+   }
+
pgtbl_ops = alloc_io_pgtable_ops(fmt, _cfg, smmu_domain);
if (!pgtbl_ops)
return -ENOMEM;
--
1.8.3


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v5 2/5] iommu/dma: add support for non-strict mode

2018-08-14 Thread Zhen Lei
1. Save the related domain pointer in struct iommu_dma_cookie, make iovad
   capable call domain->ops->flush_iotlb_all to flush TLB.
2. During the iommu domain initialization phase, base on domain->non_strict
   field to check whether non-strict mode is supported or not. If so, call
   init_iova_flush_queue to register iovad->flush_cb callback.
3. All unmap(contains iova-free) APIs will finally invoke __iommu_dma_unmap
   -->iommu_dma_free_iova. If the domain is non-strict, call queue_iova to
   put off iova freeing, and omit iommu_tlb_sync operation.

Signed-off-by: Zhen Lei 
---
 drivers/iommu/dma-iommu.c | 29 -
 drivers/iommu/iommu.c |  1 +
 include/linux/iommu.h |  1 +
 3 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index ddcbbdb..f0257e9 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -55,6 +55,9 @@ struct iommu_dma_cookie {
};
struct list_headmsi_page_list;
spinlock_t  msi_lock;
+
+   /* Only be assigned in non-strict mode, otherwise it's NULL */
+   struct iommu_domain *domain;
 };

 static inline size_t cookie_msi_granule(struct iommu_dma_cookie *cookie)
@@ -257,6 +260,17 @@ static int iova_reserve_iommu_regions(struct device *dev,
return ret;
 }

+static void iommu_dma_flush_iotlb_all(struct iova_domain *iovad)
+{
+   struct iommu_dma_cookie *cookie;
+   struct iommu_domain *domain;
+
+   cookie = container_of(iovad, struct iommu_dma_cookie, iovad);
+   domain = cookie->domain;
+
+   domain->ops->flush_iotlb_all(domain);
+}
+
 /**
  * iommu_dma_init_domain - Initialise a DMA mapping domain
  * @domain: IOMMU domain previously prepared by iommu_get_dma_cookie()
@@ -308,6 +322,14 @@ int iommu_dma_init_domain(struct iommu_domain *domain, 
dma_addr_t base,
}

init_iova_domain(iovad, 1UL << order, base_pfn);
+
+   if (domain->non_strict) {
+   BUG_ON(!domain->ops->flush_iotlb_all);
+
+   cookie->domain = domain;
+   init_iova_flush_queue(iovad, iommu_dma_flush_iotlb_all, NULL);
+   }
+
if (!dev)
return 0;

@@ -390,6 +412,9 @@ static void iommu_dma_free_iova(struct iommu_dma_cookie 
*cookie,
/* The MSI case is only ever cleaning up its most recent allocation */
if (cookie->type == IOMMU_DMA_MSI_COOKIE)
cookie->msi_iova -= size;
+   else if (cookie->domain)/* non-strict mode */
+   queue_iova(iovad, iova_pfn(iovad, iova),
+   size >> iova_shift(iovad), 0);
else
free_iova_fast(iovad, iova_pfn(iovad, iova),
size >> iova_shift(iovad));
@@ -405,7 +430,9 @@ static void __iommu_dma_unmap(struct iommu_domain *domain, 
dma_addr_t dma_addr,
dma_addr -= iova_off;
size = iova_align(iovad, size + iova_off);

-   WARN_ON(iommu_unmap(domain, dma_addr, size) != size);
+   WARN_ON(iommu_unmap_fast(domain, dma_addr, size) != size);
+   if (!domain->non_strict)
+   iommu_tlb_sync(domain);
iommu_dma_free_iova(cookie, dma_addr, size);
 }

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 63b3756..6255a69 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1263,6 +1263,7 @@ static struct iommu_domain *__iommu_domain_alloc(struct 
bus_type *bus,

domain->ops  = bus->iommu_ops;
domain->type = type;
+   domain->non_strict = false;
/* Assume all sizes by default; the driver may override this later */
domain->pgsize_bitmap  = bus->iommu_ops->pgsize_bitmap;

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 19938ee..4bbcf39 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -88,6 +88,7 @@ struct iommu_domain_geometry {

 struct iommu_domain {
unsigned type;
+   bool non_strict;
const struct iommu_ops *ops;
unsigned long pgsize_bitmap;/* Bitmap of page sizes in use */
iommu_fault_handler_t handler;
--
1.8.3


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v5 0/5] add non-strict mode support for arm-smmu-v3

2018-08-14 Thread Zhen Lei
v4 -> v5:
1. change the type of global variable and struct member named "non_strict" from
   "int" to "bool".
2. cancel the unnecessary parameter "strict" of __arm_lpae_unmap which was added
   in v4.
3. change boot option "arm_iommu" to "iommu.non_strict".
4. convert __iommu_dma_unmap to use iommu_unmap_fast()/iommu_tlb_sync(), because
   non-leaf unmaps still need to be synchronous.

Thanks for Robin's review comments.

v3 -> v4:
1. Add a new member "non_strict" in struct iommu_domain to mark whether
   that domain use non-strict mode or not. This can help us to remove the
   capability which was added in prior version.
2. Add a new quirk IO_PGTABLE_QUIRK_NON_STRICT, so that we can get "strict
   mode" in io-pgtable-arm.c according to data->iop.cfg.quirks.
3. rename the new boot option to "arm_iommu".

v2 -> v3:
Add a bootup option "iommu_strict_mode" to make the manager can choose which
mode to be used. The first 5 patches have not changed.
+   iommu_strict_mode=  [arm-smmu-v3]
+   0 - strict mode (default)
+   1 - non-strict mode

v1 -> v2:
Use the lowest bit of the io_pgtable_ops.unmap's iova parameter to pass the 
strict mode:
0, IOMMU_STRICT;
1, IOMMU_NON_STRICT;
Treat 0 as IOMMU_STRICT, so that the unmap operation can compatible with
other IOMMUs which still use strict mode. In other words, this patch series
will not impact other IOMMU drivers. I tried add a new quirk 
IO_PGTABLE_QUIRK_NON_STRICT
in io_pgtable_cfg.quirks, but it can not pass the strict mode of the domain 
from SMMUv3
driver to io-pgtable module. 

Add a new member domain_non_strict in struct iommu_dma_cookie, this member will 
only be
initialized when the related domain and IOMMU driver support non-strict mode.

v1:
In common, a IOMMU unmap operation follow the below steps:
1. remove the mapping in page table of the specified iova range
2. execute tlbi command to invalid the mapping which is cached in TLB
3. wait for the above tlbi operation to be finished
4. free the IOVA resource
5. free the physical memory resource

This maybe a problem when unmap is very frequently, the combination of tlbi
and wait operation will consume a lot of time. A feasible method is put off
tlbi and iova-free operation, when accumulating to a certain number or
reaching a specified time, execute only one tlbi_all command to clean up
TLB, then free the backup IOVAs. Mark as non-strict mode.

But it must be noted that, although the mapping has already been removed in
the page table, it maybe still exist in TLB. And the freed physical memory
may also be reused for others. So a attacker can persistent access to memory
based on the just freed IOVA, to obtain sensible data or corrupt memory. So
the VFIO should always choose the strict mode.

Some may consider put off physical memory free also, that will still follow
strict mode. But for the map_sg cases, the memory allocation is not controlled
by IOMMU APIs, so it is not enforceable.

Fortunately, Intel and AMD have already applied the non-strict mode, and put
queue_iova() operation into the common file dma-iommu.c., and my work is based
on it. The difference is that arm-smmu-v3 driver will call IOMMU common APIs to
unmap, but Intel and AMD IOMMU drivers are not.

Below is the performance data of strict vs non-strict for NVMe device:
Randomly Read  IOPS: 146K(strict) vs 573K(non-strict)
Randomly Write IOPS: 143K(strict) vs 513K(non-strict)

Zhen Lei (5):
  iommu/arm-smmu-v3: fix the implementation of flush_iotlb_all hook
  iommu/dma: add support for non-strict mode
  iommu/io-pgtable-arm: add support for non-strict mode
  iommu/arm-smmu-v3: add support for non-strict mode
  iommu/arm-smmu-v3: add bootup option "iommu.non_strict"

 Documentation/admin-guide/kernel-parameters.txt | 13 +
 drivers/iommu/arm-smmu-v3.c | 35 -
 drivers/iommu/dma-iommu.c   | 29 +++-
 drivers/iommu/io-pgtable-arm.c  | 20 +-
 drivers/iommu/io-pgtable.h  |  3 +++
 drivers/iommu/iommu.c   |  1 +
 include/linux/iommu.h   |  1 +
 7 files changed, 94 insertions(+), 8 deletions(-)

-- 
1.8.3


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 0/5] Qcom smmu-500 TLB invalidation errata for sdm845

2018-08-14 Thread Vivek Gautam

Hi Will,


On 8/14/2018 5:10 PM, Will Deacon wrote:

Hi Vivek,

On Tue, Aug 14, 2018 at 04:25:23PM +0530, Vivek Gautam wrote:

Qcom's implementation of arm,mmu-500 on sdm845 has a functional/performance
errata [1] because of which the TCU cache look ups are stalled during
invalidation cycle. This is mitigated by serializing all the invalidation
requests coming to the smmu.

How does this implementation differ from the one supported by qcom_iommu.c?
I notice you're adding firmware hooks here, which we avoided by having the
extra driver. Please help me understand which devices exist, how they
differ, and which drivers are intended to support them!


IIRC, the qcom_iommu driver was intended to support the static context 
bank - SID

mapping, and is very specific to the smmu-v2 version present on msm8916 soc.
However, this is the qcom's mmu-500 implementation specific errata. 
qcom_iommu

will not be able to support mmu-500 configurations.
Rob Clark can add more.
Let you know what you suggest.



Also -- you didn't CC all the maintainers for the firmware bits, so adding
Andy here for that, and Rob for the previous question.


I added Andy to the series, would you want me to add Rob H also?

Best regards
Vivek



Thanks,

Will


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3] iommu/iova: Optimise attempts to allocate iova from 32bit address range

2018-08-14 Thread Robin Murphy

On 14/08/18 04:21, Ganapatrao Kulkarni wrote:

As an optimisation for PCI devices, there is always first attempt
been made to allocate iova from SAC address range. This will lead
to unnecessary attempts, when there are no free ranges
available. Adding fix to track recently failed iova address size and
allow further attempts, only if requested size is lesser than a failed
size. The size is updated when any replenish happens.


Reviewed-by: Robin Murphy 


Signed-off-by: Ganapatrao Kulkarni 
---
v3:
 Update with comments [3] from Robin Murphy 

[3] https://lkml.org/lkml/2018/8/13/116

v2: update with comments [2] from Robin Murphy 

[2] https://lkml.org/lkml/2018/8/7/166

v1: Based on comments from Robin Murphy 
for patch [1]

[1] https://lkml.org/lkml/2018/4/19/780

  drivers/iommu/iova.c | 22 +++---
  include/linux/iova.h |  1 +
  2 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 83fe262..28ba8b6 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -56,6 +56,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned long 
granule,
iovad->granule = granule;
iovad->start_pfn = start_pfn;
iovad->dma_32bit_pfn = 1UL << (32 - iova_shift(iovad));
+   iovad->max32_alloc_size = iovad->dma_32bit_pfn;
iovad->flush_cb = NULL;
iovad->fq = NULL;
iovad->anchor.pfn_lo = iovad->anchor.pfn_hi = IOVA_ANCHOR;
@@ -139,8 +140,10 @@ __cached_rbnode_delete_update(struct iova_domain *iovad, 
struct iova *free)
  
  	cached_iova = rb_entry(iovad->cached32_node, struct iova, node);

if (free->pfn_hi < iovad->dma_32bit_pfn &&
-   free->pfn_lo >= cached_iova->pfn_lo)
+   free->pfn_lo >= cached_iova->pfn_lo) {
iovad->cached32_node = rb_next(>node);
+   iovad->max32_alloc_size = iovad->dma_32bit_pfn;
+   }
  
  	cached_iova = rb_entry(iovad->cached_node, struct iova, node);

if (free->pfn_lo >= cached_iova->pfn_lo)
@@ -190,6 +193,10 @@ static int __alloc_and_insert_iova_range(struct 
iova_domain *iovad,
  
  	/* Walk the tree backwards */

spin_lock_irqsave(>iova_rbtree_lock, flags);
+   if (limit_pfn <= iovad->dma_32bit_pfn &&
+   size >= iovad->max32_alloc_size)
+   goto iova32_full;
+
curr = __get_cached_rbnode(iovad, limit_pfn);
curr_iova = rb_entry(curr, struct iova, node);
do {
@@ -200,10 +207,8 @@ static int __alloc_and_insert_iova_range(struct 
iova_domain *iovad,
curr_iova = rb_entry(curr, struct iova, node);
} while (curr && new_pfn <= curr_iova->pfn_hi);
  
-	if (limit_pfn < size || new_pfn < iovad->start_pfn) {

-   spin_unlock_irqrestore(>iova_rbtree_lock, flags);
-   return -ENOMEM;
-   }
+   if (limit_pfn < size || new_pfn < iovad->start_pfn)
+   goto iova32_full;
  
  	/* pfn_lo will point to size aligned address if size_aligned is set */

new->pfn_lo = new_pfn;
@@ -214,9 +219,12 @@ static int __alloc_and_insert_iova_range(struct 
iova_domain *iovad,
__cached_rbnode_insert_update(iovad, new);
  
  	spin_unlock_irqrestore(>iova_rbtree_lock, flags);

-
-
return 0;
+
+iova32_full:
+   iovad->max32_alloc_size = size;
+   spin_unlock_irqrestore(>iova_rbtree_lock, flags);
+   return -ENOMEM;
  }
  
  static struct kmem_cache *iova_cache;

diff --git a/include/linux/iova.h b/include/linux/iova.h
index 928442d..a930411 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -75,6 +75,7 @@ struct iova_domain {
unsigned long   granule;/* pfn granularity for this domain */
unsigned long   start_pfn;  /* Lower limit for this domain */
unsigned long   dma_32bit_pfn;
+   unsigned long   max32_alloc_size; /* Size of last failed allocation */
struct iova anchor; /* rbtree lookup anchor */
struct iova_rcache rcaches[IOVA_RANGE_CACHE_MAX_SIZE];  /* IOVA range 
caches */
  


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 0/3] iommu: Avoid DMA ops domain refcount contention

2018-08-14 Thread John Garry

On 14/08/2018 14:04, Robin Murphy wrote:

John raised the issue[1] that we have some unnecessary refcount contention
in the DMA ops path which shows scalability problems now that we have more
real high-performance hardware using iommu-dma. The x86 IOMMU drivers are
sidestepping this by stashing domain references in archdata, but since
that's not very nice for architecture-agnostic code, I think it's time to
look at a generic API-level solution.

These are a couple of quick patches based on the idea I had back when
first implementing this lot, but didn't have any way to justify at the
time. The third patch can be ignored for the sake of API discussion, but
is included for completeness.

Robin.


[1] https://lists.linuxfoundation.org/pipermail/iommu/2018-August/029303.html

Robin Murphy (3):
  iommu: Add fast hook for getting DMA domains
  iommu/dma: Use fast DMA domain lookup
  arm64/dma-mapping: Mildly optimise non-coherent IOMMU ops

 arch/arm64/mm/dma-mapping.c | 10 +-
 drivers/iommu/dma-iommu.c   | 23 ---
 drivers/iommu/iommu.c   |  9 +
 include/linux/iommu.h   |  1 +
 4 files changed, 27 insertions(+), 16 deletions(-)



Thanks Robin. So we'll test this now. I anticipate a decent improvement 
across the board since we are now simply dereferencing the device's 
iommu group directly.


Cheers,
John

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: IOAT DMA w/IOMMU

2018-08-14 Thread Kit Chow



On 08/13/2018 04:50 PM, Logan Gunthorpe wrote:


On 13/08/18 05:48 PM, Kit Chow wrote:

On 08/13/2018 04:39 PM, Logan Gunthorpe wrote:

On 13/08/18 05:30 PM, Kit Chow wrote:

In arch/x86/include/asm/page.h, there is the following comment in
regards to validating the virtual address.

/*
   * virt_to_page(kaddr) returns a valid pointer if and only if
   * virt_addr_valid(kaddr) returns true.
   */
#define virt_to_page(kaddr) pfn_to_page(__pa(kaddr) >> PAGE_SHIFT)

So it looks like the validation by virt_addr_valid was somehow dropped
from the virt_to_page code path. Does anyone have any ideas what
happended to it?

I don't think it was ever validated (though I haven't been around long
enough to say for certain). What the comment is saying is that you
shouldn't rely on virt_to_page() unless you know virt_addr_valid() is
true (which in most cases you can without the extra check). virt_to_page
is meant to be really fast so adding an extra validation would probably
be a significant performance regression for the entire kernel.

The fact that this can happen through dma_map_single() is non-ideal at
best. It assumes the caller is mapping regular memory and doesn't check
this at all. It may make sense to fix that but I think people expect
dma_map_single() to be as fast as possible as well...


Perhaps include the validation with some debug turned on?

The problem is how often do you develop code with any of the debug
config options turned on?

There's already a couple of BUG_ONs in dma_map_single so maybe another
one with virt_addr_valid wouldn't be so bad.
Had my very first Linux crash on the dma direction BUG_ON when I tried 
DMA_NONE :).


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: IOAT DMA w/IOMMU

2018-08-14 Thread Robin Murphy

On 14/08/18 00:50, Logan Gunthorpe wrote:

On 13/08/18 05:48 PM, Kit Chow wrote:

On 08/13/2018 04:39 PM, Logan Gunthorpe wrote:


On 13/08/18 05:30 PM, Kit Chow wrote:

In arch/x86/include/asm/page.h, there is the following comment in
regards to validating the virtual address.

/*
   * virt_to_page(kaddr) returns a valid pointer if and only if
   * virt_addr_valid(kaddr) returns true.
   */
#define virt_to_page(kaddr) pfn_to_page(__pa(kaddr) >> PAGE_SHIFT)

So it looks like the validation by virt_addr_valid was somehow dropped
from the virt_to_page code path. Does anyone have any ideas what
happended to it?

I don't think it was ever validated (though I haven't been around long
enough to say for certain). What the comment is saying is that you
shouldn't rely on virt_to_page() unless you know virt_addr_valid() is
true (which in most cases you can without the extra check). virt_to_page
is meant to be really fast so adding an extra validation would probably
be a significant performance regression for the entire kernel.

The fact that this can happen through dma_map_single() is non-ideal at
best. It assumes the caller is mapping regular memory and doesn't check
this at all. It may make sense to fix that but I think people expect
dma_map_single() to be as fast as possible as well...


dma_map_single() is already documented as only supporting lowmem (for 
which virt_to_page() can be assumed to be valid). You might get away 
with feeding it bogus addresses on x86, but on non-coherent 
architectures which convert the page back to a virtual address to 
perform cache maintenance you can expect that to crash and burn rapidly.


There may be some minimal-overhead sanity checking of fundamentals, but 
in general it's not really the DMA API's job to police its callers 
exhaustively; consider that the mm layer doesn't go out of its way to 
stop you from doing things like "kfree(kfree);" either.





Perhaps include the validation with some debug turned on?


The problem is how often do you develop code with any of the debug
config options turned on?

There's already a couple of BUG_ONs in dma_map_single so maybe another
one with virt_addr_valid wouldn't be so bad.


Note that virt_addr_valid() may be pretty heavyweight in itself. For 
example the arm64 implementation involves memblock_search(); that really 
isn't viable in a DMA mapping fastpath.


Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 4/5] iommu/arm-smmu: Make way to add Qcom's smmu-500 errata handling

2018-08-14 Thread Will Deacon
On Tue, Aug 14, 2018 at 04:25:27PM +0530, Vivek Gautam wrote:
> Cleanup to re-use some of the stuff

Maybe we should factor a few of the other bits whilst we're here.

Or just write a proper commit message ;)

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 0/5] Qcom smmu-500 TLB invalidation errata for sdm845

2018-08-14 Thread Will Deacon
Hi Vivek,

On Tue, Aug 14, 2018 at 04:25:23PM +0530, Vivek Gautam wrote:
> Qcom's implementation of arm,mmu-500 on sdm845 has a functional/performance
> errata [1] because of which the TCU cache look ups are stalled during
> invalidation cycle. This is mitigated by serializing all the invalidation
> requests coming to the smmu.

How does this implementation differ from the one supported by qcom_iommu.c?
I notice you're adding firmware hooks here, which we avoided by having the
extra driver. Please help me understand which devices exist, how they
differ, and which drivers are intended to support them!

Also -- you didn't CC all the maintainers for the firmware bits, so adding
Andy here for that, and Rob for the previous question.

Thanks,

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 4/5] iommu/arm-smmu: Make way to add Qcom's smmu-500 errata handling

2018-08-14 Thread Vivek Gautam




On 8/14/2018 5:10 PM, Will Deacon wrote:

On Tue, Aug 14, 2018 at 04:25:27PM +0530, Vivek Gautam wrote:

Cleanup to re-use some of the stuff

Maybe we should factor a few of the other bits whilst we're here.


Sure, do you want me to refactor anything besides this change?



Or just write a proper commit message ;)


My bad. I should have written a more descriptive commit message. :|
Will change this.

Best regards
Vivek



Will


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 1/3] iommu: Add fast hook for getting DMA domains

2018-08-14 Thread Robin Murphy
While iommu_get_domain_for_dev() is the robust way for arbitrary IOMMU
API callers to retrieve the domain pointer, for DMA ops domains it
doesn't scale well for large systems and multi-queue devices, since the
momentary refcount adjustment will lead to exclusive cacheline contention
when multiple CPUs are operating in parallel on different mappings for
the same device.

In the case of DMA ops domains, however, this refcounting is actually
unnecessary, since they already imply that the group exists and is
managed by platform code and IOMMU internals (by virtue of
iommu_group_get_for_dev()) such that a reference will already be held
for the lifetime of the device. Thus we can avoid the bottleneck by
providing a fast lookup specifically for the DMA code to retrieve the
default domain it already knows it has set up - a simple read-only
dereference plays much nicer with cache-coherency protocols.

Signed-off-by: Robin Murphy 
---
 drivers/iommu/iommu.c | 9 +
 include/linux/iommu.h | 1 +
 2 files changed, 10 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 63b37563db7e..63c586875df5 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1379,6 +1379,15 @@ struct iommu_domain *iommu_get_domain_for_dev(struct 
device *dev)
 }
 EXPORT_SYMBOL_GPL(iommu_get_domain_for_dev);
 
+/*
+ * For IOMMU_DOMAIN_DMA implementations which already provide their own
+ * guarantees that the group and its default domain are valid and correct.
+ */
+struct iommu_domain *iommu_get_dma_domain(struct device *dev)
+{
+   return dev->iommu_group->default_domain;
+}
+
 /*
  * IOMMU groups are really the natrual working unit of the IOMMU, but
  * the IOMMU API works on domains and devices.  Bridge that gap by
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 19938ee6eb31..16f2172698e5 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -297,6 +297,7 @@ extern int iommu_attach_device(struct iommu_domain *domain,
 extern void iommu_detach_device(struct iommu_domain *domain,
struct device *dev);
 extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
+extern struct iommu_domain *iommu_get_dma_domain(struct device *dev);
 extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
 phys_addr_t paddr, size_t size, int prot);
 extern size_t iommu_unmap(struct iommu_domain *domain, unsigned long iova,
-- 
2.17.1.dirty

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 2/3] iommu/dma: Use fast DMA domain lookup

2018-08-14 Thread Robin Murphy
Most parts of iommu-dma already assume they are operating on a default
domain set up by iommu_dma_init_domain(), and can be converted straight
over to avoid the refcounting bottleneck. MSI page mappings may be in
an unmanaged domain with an explicit MSI-only cookie, so retain the
non-specific lookup, but that's OK since they're far from a contended
fast path either way.

Signed-off-by: Robin Murphy 
---
 drivers/iommu/dma-iommu.c | 23 ---
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 8f549bdb866d..a27eedc16b32 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -510,7 +510,7 @@ static struct page **__iommu_dma_alloc_pages(unsigned int 
count,
 void iommu_dma_free(struct device *dev, struct page **pages, size_t size,
dma_addr_t *handle)
 {
-   __iommu_dma_unmap(iommu_get_domain_for_dev(dev), *handle, size);
+   __iommu_dma_unmap(iommu_get_dma_domain(dev), *handle, size);
__iommu_dma_free_pages(pages, PAGE_ALIGN(size) >> PAGE_SHIFT);
*handle = IOMMU_MAPPING_ERROR;
 }
@@ -537,7 +537,7 @@ struct page **iommu_dma_alloc(struct device *dev, size_t 
size, gfp_t gfp,
unsigned long attrs, int prot, dma_addr_t *handle,
void (*flush_page)(struct device *, const void *, phys_addr_t))
 {
-   struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
+   struct iommu_domain *domain = iommu_get_dma_domain(dev);
struct iommu_dma_cookie *cookie = domain->iova_cookie;
struct iova_domain *iovad = >iovad;
struct page **pages;
@@ -625,9 +625,8 @@ int iommu_dma_mmap(struct page **pages, size_t size, struct 
vm_area_struct *vma)
 }
 
 static dma_addr_t __iommu_dma_map(struct device *dev, phys_addr_t phys,
-   size_t size, int prot)
+   size_t size, int prot, struct iommu_domain *domain)
 {
-   struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
struct iommu_dma_cookie *cookie = domain->iova_cookie;
size_t iova_off = 0;
dma_addr_t iova;
@@ -651,13 +650,14 @@ static dma_addr_t __iommu_dma_map(struct device *dev, 
phys_addr_t phys,
 dma_addr_t iommu_dma_map_page(struct device *dev, struct page *page,
unsigned long offset, size_t size, int prot)
 {
-   return __iommu_dma_map(dev, page_to_phys(page) + offset, size, prot);
+   return __iommu_dma_map(dev, page_to_phys(page) + offset, size, prot,
+   iommu_get_dma_domain(dev));
 }
 
 void iommu_dma_unmap_page(struct device *dev, dma_addr_t handle, size_t size,
enum dma_data_direction dir, unsigned long attrs)
 {
-   __iommu_dma_unmap(iommu_get_domain_for_dev(dev), handle, size);
+   __iommu_dma_unmap(iommu_get_dma_domain(dev), handle, size);
 }
 
 /*
@@ -745,7 +745,7 @@ static void __invalidate_sg(struct scatterlist *sg, int 
nents)
 int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
int nents, int prot)
 {
-   struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
+   struct iommu_domain *domain = iommu_get_dma_domain(dev);
struct iommu_dma_cookie *cookie = domain->iova_cookie;
struct iova_domain *iovad = >iovad;
struct scatterlist *s, *prev = NULL;
@@ -830,20 +830,21 @@ void iommu_dma_unmap_sg(struct device *dev, struct 
scatterlist *sg, int nents,
sg = tmp;
}
end = sg_dma_address(sg) + sg_dma_len(sg);
-   __iommu_dma_unmap(iommu_get_domain_for_dev(dev), start, end - start);
+   __iommu_dma_unmap(iommu_get_dma_domain(dev), start, end - start);
 }
 
 dma_addr_t iommu_dma_map_resource(struct device *dev, phys_addr_t phys,
size_t size, enum dma_data_direction dir, unsigned long attrs)
 {
return __iommu_dma_map(dev, phys, size,
-   dma_info_to_prot(dir, false, attrs) | IOMMU_MMIO);
+   dma_info_to_prot(dir, false, attrs) | IOMMU_MMIO,
+   iommu_get_dma_domain(dev));
 }
 
 void iommu_dma_unmap_resource(struct device *dev, dma_addr_t handle,
size_t size, enum dma_data_direction dir, unsigned long attrs)
 {
-   __iommu_dma_unmap(iommu_get_domain_for_dev(dev), handle, size);
+   __iommu_dma_unmap(iommu_get_dma_domain(dev), handle, size);
 }
 
 int iommu_dma_mapping_error(struct device *dev, dma_addr_t dma_addr)
@@ -869,7 +870,7 @@ static struct iommu_dma_msi_page 
*iommu_dma_get_msi_page(struct device *dev,
if (!msi_page)
return NULL;
 
-   iova = __iommu_dma_map(dev, msi_addr, size, prot);
+   iova = __iommu_dma_map(dev, msi_addr, size, prot, domain);
if (iommu_dma_mapping_error(dev, iova))
goto out_free_page;
 
-- 
2.17.1.dirty

___
iommu mailing list
iommu@lists.linux-foundation.org

[PATCH 3/3] arm64/dma-mapping: Mildly optimise non-coherent IOMMU ops

2018-08-14 Thread Robin Murphy
Whilst the symmetry of deferring to the existing sync callback in
__iommu_map_page() is nice, taking a round-trip through
iommu_iova_to_phys() is a pretty heavyweight way to get an address we
can trivially compute from the page we already have. Tweaking it to just
perform the cache maintenance directly when appropriate doesn't really
make the code any more complicated, and the runtime efficiency gain can
only be a benefit.

Furthermore, the sync operations themselves know they can only be
invoked on a managed DMA ops domain, so can use the fast specific domain
lookup to avoid excessive manipulation of the group refcount
(particularly in the scatterlist cases).

Signed-off-by: Robin Murphy 
---
 arch/arm64/mm/dma-mapping.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
index 61e93f0b5482..5d4144093c20 100644
--- a/arch/arm64/mm/dma-mapping.c
+++ b/arch/arm64/mm/dma-mapping.c
@@ -712,7 +712,7 @@ static void __iommu_sync_single_for_cpu(struct device *dev,
if (is_device_dma_coherent(dev))
return;
 
-   phys = iommu_iova_to_phys(iommu_get_domain_for_dev(dev), dev_addr);
+   phys = iommu_iova_to_phys(iommu_get_dma_domain(dev), dev_addr);
__dma_unmap_area(phys_to_virt(phys), size, dir);
 }
 
@@ -725,7 +725,7 @@ static void __iommu_sync_single_for_device(struct device 
*dev,
if (is_device_dma_coherent(dev))
return;
 
-   phys = iommu_iova_to_phys(iommu_get_domain_for_dev(dev), dev_addr);
+   phys = iommu_iova_to_phys(iommu_get_dma_domain(dev), dev_addr);
__dma_map_area(phys_to_virt(phys), size, dir);
 }
 
@@ -738,9 +738,9 @@ static dma_addr_t __iommu_map_page(struct device *dev, 
struct page *page,
int prot = dma_info_to_prot(dir, coherent, attrs);
dma_addr_t dev_addr = iommu_dma_map_page(dev, page, offset, size, prot);
 
-   if (!iommu_dma_mapping_error(dev, dev_addr) &&
-   (attrs & DMA_ATTR_SKIP_CPU_SYNC) == 0)
-   __iommu_sync_single_for_device(dev, dev_addr, size, dir);
+   if (!coherent && !(attrs & DMA_ATTR_SKIP_CPU_SYNC) &&
+   !iommu_dma_mapping_error(dev, dev_addr))
+   __dma_map_area(page_address(page), size, dir);
 
return dev_addr;
 }
-- 
2.17.1.dirty

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 0/3] iommu: Avoid DMA ops domain refcount contention

2018-08-14 Thread Robin Murphy
John raised the issue[1] that we have some unnecessary refcount contention
in the DMA ops path which shows scalability problems now that we have more
real high-performance hardware using iommu-dma. The x86 IOMMU drivers are
sidestepping this by stashing domain references in archdata, but since
that's not very nice for architecture-agnostic code, I think it's time to
look at a generic API-level solution.

These are a couple of quick patches based on the idea I had back when
first implementing this lot, but didn't have any way to justify at the
time. The third patch can be ignored for the sake of API discussion, but
is included for completeness. 

Robin.


[1] https://lists.linuxfoundation.org/pipermail/iommu/2018-August/029303.html

Robin Murphy (3):
  iommu: Add fast hook for getting DMA domains
  iommu/dma: Use fast DMA domain lookup
  arm64/dma-mapping: Mildly optimise non-coherent IOMMU ops

 arch/arm64/mm/dma-mapping.c | 10 +-
 drivers/iommu/dma-iommu.c   | 23 ---
 drivers/iommu/iommu.c   |  9 +
 include/linux/iommu.h   |  1 +
 4 files changed, 27 insertions(+), 16 deletions(-)

-- 
2.17.1.dirty

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu