date:20200821

Re: [patch RFC 38/38] irqchip: Add IMS array driver - NOT FOR MERGING

2020-08-21 Thread Jason Gunthorpe

On Fri, Aug 21, 2020 at 02:25:02AM +0200, Thomas Gleixner wrote:
> +static void ims_mask_irq(struct irq_data *data)
> +{
> + struct msi_desc *desc = irq_data_get_msi_desc(data);
> + struct ims_array_slot __iomem *slot = desc->device_msi.priv_iomem;
> + u32 __iomem *ctrl = >ctrl;
> +
> + iowrite32(ioread32(ctrl) & ~IMS_VECTOR_CTRL_UNMASK, ctrl);

Just to be clear, this is exactly the sort of operation we can't do
with non-MSI interrupts. For a real PCI device to execute this it
would have to keep the data on die.

I saw the idxd driver was doing something like this, I assume it
avoids trouble because it is a fake PCI device integrated with the
CPU, not on a real PCI bus?

It is really nice to see irq_domain used properly in x86!

Thanks,
Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH AUTOSEL 5.4 15/48] iommu/iova: Don't BUG on invalid PFNs

2020-08-21 Thread Sasha Levin

From: Robin Murphy 

[ Upstream commit d3e3d2be688b4b5864538de61e750721a311e4fc ]

Unlike the other instances which represent a complete loss of
consistency within the rcache mechanism itself, or a fundamental
and obvious misconfiguration by an IOMMU driver, the BUG_ON() in
iova_magazine_free_pfns() can be provoked at more or less any time
in a "spooky action-at-a-distance" manner by any old device driver
passing nonsense to dma_unmap_*() which then propagates through to
queue_iova().

Not only is this well outside the IOVA layer's control, it's also
nowhere near fatal enough to justify panicking anyway - all that
really achieves is to make debugging the offending driver more
difficult. Let's simply WARN and otherwise ignore bogus PFNs.

Reported-by: Prakash Gupta 
Signed-off-by: Robin Murphy 
Reviewed-by: Prakash Gupta 
Link: 
https://lore.kernel.org/r/acbd2d092b42738a03a21b417ce64e27f8c91c86.1591103298.git.robin.mur...@arm.com
Signed-off-by: Joerg Roedel 
Signed-off-by: Sasha Levin 
---
 drivers/iommu/iova.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 0e6a9536eca62..612cbf668adf8 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -811,7 +811,9 @@ iova_magazine_free_pfns(struct iova_magazine *mag, struct 
iova_domain *iovad)
for (i = 0 ; i < mag->size; ++i) {
struct iova *iova = private_find_iova(iovad, mag->pfns[i]);
 
-   BUG_ON(!iova);
+   if (WARN_ON(!iova))
+   continue;
+
private_free_iova(iovad, iova);
}
 
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH AUTOSEL 4.14 09/30] iommu/iova: Don't BUG on invalid PFNs

2020-08-21 Thread Sasha Levin

From: Robin Murphy 

[ Upstream commit d3e3d2be688b4b5864538de61e750721a311e4fc ]

Unlike the other instances which represent a complete loss of
consistency within the rcache mechanism itself, or a fundamental
and obvious misconfiguration by an IOMMU driver, the BUG_ON() in
iova_magazine_free_pfns() can be provoked at more or less any time
in a "spooky action-at-a-distance" manner by any old device driver
passing nonsense to dma_unmap_*() which then propagates through to
queue_iova().

Not only is this well outside the IOVA layer's control, it's also
nowhere near fatal enough to justify panicking anyway - all that
really achieves is to make debugging the offending driver more
difficult. Let's simply WARN and otherwise ignore bogus PFNs.

Reported-by: Prakash Gupta 
Signed-off-by: Robin Murphy 
Reviewed-by: Prakash Gupta 
Link: 
https://lore.kernel.org/r/acbd2d092b42738a03a21b417ce64e27f8c91c86.1591103298.git.robin.mur...@arm.com
Signed-off-by: Joerg Roedel 
Signed-off-by: Sasha Levin 
---
 drivers/iommu/iova.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 4edf65dbbcab5..2c97d2552c5bd 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -845,7 +845,9 @@ iova_magazine_free_pfns(struct iova_magazine *mag, struct 
iova_domain *iovad)
for (i = 0 ; i < mag->size; ++i) {
struct iova *iova = private_find_iova(iovad, mag->pfns[i]);
 
-   BUG_ON(!iova);
+   if (WARN_ON(!iova))
+   continue;
+
private_free_iova(iovad, iova);
}
 
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH AUTOSEL 5.8 19/62] iommu/iova: Don't BUG on invalid PFNs

2020-08-21 Thread Sasha Levin

From: Robin Murphy 

[ Upstream commit d3e3d2be688b4b5864538de61e750721a311e4fc ]

Unlike the other instances which represent a complete loss of
consistency within the rcache mechanism itself, or a fundamental
and obvious misconfiguration by an IOMMU driver, the BUG_ON() in
iova_magazine_free_pfns() can be provoked at more or less any time
in a "spooky action-at-a-distance" manner by any old device driver
passing nonsense to dma_unmap_*() which then propagates through to
queue_iova().

Not only is this well outside the IOVA layer's control, it's also
nowhere near fatal enough to justify panicking anyway - all that
really achieves is to make debugging the offending driver more
difficult. Let's simply WARN and otherwise ignore bogus PFNs.

Reported-by: Prakash Gupta 
Signed-off-by: Robin Murphy 
Reviewed-by: Prakash Gupta 
Link: 
https://lore.kernel.org/r/acbd2d092b42738a03a21b417ce64e27f8c91c86.1591103298.git.robin.mur...@arm.com
Signed-off-by: Joerg Roedel 
Signed-off-by: Sasha Levin 
---
 drivers/iommu/iova.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 49fc01f2a28d4..45a251da54537 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -811,7 +811,9 @@ iova_magazine_free_pfns(struct iova_magazine *mag, struct 
iova_domain *iovad)
for (i = 0 ; i < mag->size; ++i) {
struct iova *iova = private_find_iova(iovad, mag->pfns[i]);
 
-   BUG_ON(!iova);
+   if (WARN_ON(!iova))
+   continue;
+
private_free_iova(iovad, iova);
}
 
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH AUTOSEL 4.9 06/26] iommu/iova: Don't BUG on invalid PFNs

2020-08-21 Thread Sasha Levin

From: Robin Murphy 

[ Upstream commit d3e3d2be688b4b5864538de61e750721a311e4fc ]

Unlike the other instances which represent a complete loss of
consistency within the rcache mechanism itself, or a fundamental
and obvious misconfiguration by an IOMMU driver, the BUG_ON() in
iova_magazine_free_pfns() can be provoked at more or less any time
in a "spooky action-at-a-distance" manner by any old device driver
passing nonsense to dma_unmap_*() which then propagates through to
queue_iova().

Not only is this well outside the IOVA layer's control, it's also
nowhere near fatal enough to justify panicking anyway - all that
really achieves is to make debugging the offending driver more
difficult. Let's simply WARN and otherwise ignore bogus PFNs.

Reported-by: Prakash Gupta 
Signed-off-by: Robin Murphy 
Reviewed-by: Prakash Gupta 
Link: 
https://lore.kernel.org/r/acbd2d092b42738a03a21b417ce64e27f8c91c86.1591103298.git.robin.mur...@arm.com
Signed-off-by: Joerg Roedel 
Signed-off-by: Sasha Levin 
---
 drivers/iommu/iova.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index f106fd9782bfb..99c36a5438a75 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -676,7 +676,9 @@ iova_magazine_free_pfns(struct iova_magazine *mag, struct 
iova_domain *iovad)
for (i = 0 ; i < mag->size; ++i) {
struct iova *iova = private_find_iova(iovad, mag->pfns[i]);
 
-   BUG_ON(!iova);
+   if (WARN_ON(!iova))
+   continue;
+
private_free_iova(iovad, iova);
}
 
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 10/20] dt-bindings: arm-smmu: Add compatible string for Adreno GPU SMMU

2020-08-21 Thread Jordan Crouse

On Wed, Aug 19, 2020 at 10:36:38AM -0700, Rob Clark wrote:
> On Wed, Aug 19, 2020 at 10:03 AM Doug Anderson  wrote:
> >
> > Hi,
> >
> > On Mon, Aug 17, 2020 at 3:03 PM Rob Clark  wrote:
> > >
> > > From: Jordan Crouse 
> > >
> > > Every Qcom Adreno GPU has an embedded SMMU for its own use. These
> > > devices depend on unique features such as split pagetables,
> > > different stall/halt requirements and other settings. Identify them
> > > with a compatible string so that they can be identified in the
> > > arm-smmu implementation specific code.
> > >
> > > Signed-off-by: Jordan Crouse 
> > > Reviewed-by: Rob Herring 
> > > Signed-off-by: Rob Clark 
> > > ---
> > >  Documentation/devicetree/bindings/iommu/arm,smmu.yaml | 4 
> > >  1 file changed, 4 insertions(+)
> > >
> > > diff --git a/Documentation/devicetree/bindings/iommu/arm,smmu.yaml 
> > > b/Documentation/devicetree/bindings/iommu/arm,smmu.yaml
> > > index 503160a7b9a0..5ec5d0d691f6 100644
> > > --- a/Documentation/devicetree/bindings/iommu/arm,smmu.yaml
> > > +++ b/Documentation/devicetree/bindings/iommu/arm,smmu.yaml
> > > @@ -40,6 +40,10 @@ properties:
> > >- qcom,sm8150-smmu-500
> > >- qcom,sm8250-smmu-500
> > >- const: arm,mmu-500
> > > +  - description: Qcom Adreno GPUs implementing "arm,smmu-v2"
> > > +items:
> > > +  - const: qcom,adreno-smmu
> > > +  - const: qcom,smmu-v2
> >
> > I know I'm kinda late to the game, but this seems weird to me,
> > especially given the later patches in the series like:
> >
> > https://lore.kernel.org/r/20200817220238.603465-19-robdcl...@gmail.com
> >
> > Specifically in that patch you can see that this IOMMU already had a
> > compatible string and we're changing it and throwing away the
> > model-specific string?  I'm guessing that you're just trying to make
> > it easier for code to identify the adreno iommu, but it seems like a
> > better way would have been to just add the adreno compatible in the
> > middle, like:
> >
> >   - description: Qcom Adreno GPUs implementing "arm,smmu-v2"
> > items:
> >   - enum:
> >   - qcom,msm8996-smmu-v2
> >   - qcom,msm8998-smmu-v2
> >   - qcom,sc7180-smmu-v2
> >   - qcom,sdm845-smmu-v2
> > - const: qcom,adreno-smmu
> > - const: qcom,smmu-v2
> >
> > Then we still have the SoC-specific compatible string in case we need
> > it but we also have the generic one?  It also means that we're not
> > deleting the old compatible string...
> 
> I did bring up the thing about removing the compat string in an
> earlier revision of the series.. but then we realized that
> qcom,sc7180-smmu-v2 was never actually used anywhere.
> 
> But I guess we could:  compatible = "qcom,sc7180-smmu-v2",
> "qcom,adreno-smmu", "qcom,smmu-v2";

I think the SoC specific string is intended for the "other" SMMU that everybody
else uses. Rarely would a workaround for that SMMU affect the GPU and vice
versa. Since these are the bindings it doesn't hurt to allow for the possibility
but I would be surprised if the occasion presented itself.

Jordan

> BR,
> -R
> 
> 
> 
> 
> >
> > -Doug
> >
> >
> > >- description: Marvell SoCs implementing "arm,mmu-500"
> > >  items:
> > >- const: marvell,ap806-smmu-500
> > > --
> > > 2.26.2
> > >

-- 
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH AUTOSEL 5.7 19/61] iommu/iova: Don't BUG on invalid PFNs

2020-08-21 Thread Sasha Levin

From: Robin Murphy 

[ Upstream commit d3e3d2be688b4b5864538de61e750721a311e4fc ]

Unlike the other instances which represent a complete loss of
consistency within the rcache mechanism itself, or a fundamental
and obvious misconfiguration by an IOMMU driver, the BUG_ON() in
iova_magazine_free_pfns() can be provoked at more or less any time
in a "spooky action-at-a-distance" manner by any old device driver
passing nonsense to dma_unmap_*() which then propagates through to
queue_iova().

Not only is this well outside the IOVA layer's control, it's also
nowhere near fatal enough to justify panicking anyway - all that
really achieves is to make debugging the offending driver more
difficult. Let's simply WARN and otherwise ignore bogus PFNs.

Reported-by: Prakash Gupta 
Signed-off-by: Robin Murphy 
Reviewed-by: Prakash Gupta 
Link: 
https://lore.kernel.org/r/acbd2d092b42738a03a21b417ce64e27f8c91c86.1591103298.git.robin.mur...@arm.com
Signed-off-by: Joerg Roedel 
Signed-off-by: Sasha Levin 
---
 drivers/iommu/iova.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 0e6a9536eca62..612cbf668adf8 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -811,7 +811,9 @@ iova_magazine_free_pfns(struct iova_magazine *mag, struct 
iova_domain *iovad)
for (i = 0 ; i < mag->size; ++i) {
struct iova *iova = private_find_iova(iovad, mag->pfns[i]);
 
-   BUG_ON(!iova);
+   if (WARN_ON(!iova))
+   continue;
+
private_free_iova(iovad, iova);
}
 
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH] MAINTAINERS: update QUALCOMM IOMMU after Arm SSMU drivers move

2020-08-21 Thread Will Deacon

On Sun, Aug 02, 2020 at 08:53:20AM +0200, Lukas Bulwahn wrote:
> Commit e86d1aa8b60f ("iommu/arm-smmu: Move Arm SMMU drivers into their own
> subdirectory") moved drivers/iommu/qcom_iommu.c to
> drivers/iommu/arm/arm-smmu/qcom_iommu.c amongst other moves, adjusted some
> sections in MAINTAINERS, but missed adjusting the QUALCOMM IOMMU section.
> 
> Hence, ./scripts/get_maintainer.pl --self-test=patterns complains:
> 
>   warning: no file matchesF:drivers/iommu/qcom_iommu.c
> 
> Update the file entry in MAINTAINERS to the new location.
> 
> Signed-off-by: Lukas Bulwahn 
> ---
> Will, please ack.

Typo in subject: s/SSMU/SMMU/

With that:

Acked-by: Will Deacon 

> Joerg, please pick this minor non-urgent patch for your -next branch.

Joerg -- can you queue this as a fix for 5.9-rc, please?

Thanks,

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH AUTOSEL 4.19 11/38] iommu/iova: Don't BUG on invalid PFNs

2020-08-21 Thread Sasha Levin

From: Robin Murphy 

[ Upstream commit d3e3d2be688b4b5864538de61e750721a311e4fc ]

Unlike the other instances which represent a complete loss of
consistency within the rcache mechanism itself, or a fundamental
and obvious misconfiguration by an IOMMU driver, the BUG_ON() in
iova_magazine_free_pfns() can be provoked at more or less any time
in a "spooky action-at-a-distance" manner by any old device driver
passing nonsense to dma_unmap_*() which then propagates through to
queue_iova().

Not only is this well outside the IOVA layer's control, it's also
nowhere near fatal enough to justify panicking anyway - all that
really achieves is to make debugging the offending driver more
difficult. Let's simply WARN and otherwise ignore bogus PFNs.

Reported-by: Prakash Gupta 
Signed-off-by: Robin Murphy 
Reviewed-by: Prakash Gupta 
Link: 
https://lore.kernel.org/r/acbd2d092b42738a03a21b417ce64e27f8c91c86.1591103298.git.robin.mur...@arm.com
Signed-off-by: Joerg Roedel 
Signed-off-by: Sasha Levin 
---
 drivers/iommu/iova.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 34c058c24b9d2..ce5cd05253db9 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -814,7 +814,9 @@ iova_magazine_free_pfns(struct iova_magazine *mag, struct 
iova_domain *iovad)
for (i = 0 ; i < mag->size; ++i) {
struct iova *iova = private_find_iova(iovad, mag->pfns[i]);
 
-   BUG_ON(!iova);
+   if (WARN_ON(!iova))
+   continue;
+
private_free_iova(iovad, iova);
}
 
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [Regression] "iommu/amd: Relax locking in dma_ops path" makes tg3 ethernet transmit queue timeout

2020-08-21 Thread Joerg Roedel

On Fri, Aug 21, 2020 at 09:50:33PM +0800, Kai-Heng Feng wrote:
> Of course, I still have the system at my side.
> 
> The offending commit is 92d420ec028d ("iommu/amd: Relax locking in
> dma_ops path"), however be62dbf554c5 ("iommu/amd: Convert AMD iommu
> driver to the dma-iommu api") removed .map_page entirely so I don't
> know where to start.

I guess you don't see any AMD-Vi page-faults reported in dmesg, right?
That makes things more difficult. As a first step, can you please send
me a complete dmesg after this happened? Also please boot with
amd_iommu_dump on the kernel command line.

Regards,

Joerg
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v7 1/3] dma-contiguous: provide the ability to reserve per-numa CMA

2020-08-21 Thread Randy Dunlap

On 8/21/20 4:33 AM, Barry Song wrote:
> ---
>  -v7: with respect to Will's comments
>  * move to use for_each_online_node
>  * add description if users don't specify pernuma_cma
>  * provide default value for CONFIG_DMA_PERNUMA_CMA
> 
>  .../admin-guide/kernel-parameters.txt |  11 ++
>  include/linux/dma-contiguous.h|   6 ++
>  kernel/dma/Kconfig|  11 ++
>  kernel/dma/contiguous.c   | 100 --
>  4 files changed, 118 insertions(+), 10 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> b/Documentation/admin-guide/kernel-parameters.txt
> index bdc1f33fd3d1..c609527fc35a 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -599,6 +599,17 @@
>   altogether. For more information, see
>   include/linux/dma-contiguous.h
>  
> + pernuma_cma=nn[MG]
> + [ARM64,KNL]
> + Sets the size of kernel per-numa memory area for
> + contiguous memory allocations. A value of 0 disables
> + per-numa CMA altogether. And If this option is not
> + specificed, the default value is 0.
> + With per-numa CMA enabled, DMA users on node nid will
> + first try to allocate buffer from the pernuma area
> + which is located in node nid, if the allocation fails,
> + they will fallback to the global default memory area.
> +

Entries in kernel-parameters.txt are supposed to be in alphabetical order
but this one is not.  If you want to keep it near the cma= entry, you can
rename it like Mike suggested.  Otherwise it needs to be moved.


>   cmo_free_hint=  [PPC] Format: { yes | no }
>   Specify whether pages are marked as being inactive
>   when they are freed.  This is used in CMO environments



-- 
~Randy

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v7 1/3] dma-contiguous: provide the ability to reserve per-numa CMA

2020-08-21 Thread Mike Rapoport

On Fri, Aug 21, 2020 at 11:33:53PM +1200, Barry Song wrote:
> Right now, drivers like ARM SMMU are using dma_alloc_coherent() to get
> coherent DMA buffers to save their command queues and page tables. As
> there is only one default CMA in the whole system, SMMUs on nodes other
> than node0 will get remote memory. This leads to significant latency.
> 
> This patch provides per-numa CMA so that drivers like SMMU can get local
> memory. Tests show localizing CMA can decrease dma_unmap latency much.
> For instance, before this patch, SMMU on node2  has to wait for more than
> 560ns for the completion of CMD_SYNC in an empty command queue; with this
> patch, it needs 240ns only.
> 
> A positive side effect of this patch would be improving performance even
> further for those users who are worried about performance more than DMA
> security and use iommu.passthrough=1 to skip IOMMU. With local CMA, all
> drivers can get local coherent DMA buffers.
> 
> Cc: Jonathan Cameron 
> Cc: Christoph Hellwig 
> Cc: Marek Szyprowski 
> Cc: Will Deacon 
> Cc: Robin Murphy 
> Cc: Ganapatrao Kulkarni 
> Cc: Catalin Marinas 
> Cc: Nicolas Saenz Julienne 
> Cc: Steve Capper 
> Cc: Andrew Morton 
> Cc: Mike Rapoport 
> Signed-off-by: Barry Song 
> ---
>  -v7: with respect to Will's comments
>  * move to use for_each_online_node
>  * add description if users don't specify pernuma_cma
>  * provide default value for CONFIG_DMA_PERNUMA_CMA
> 
>  .../admin-guide/kernel-parameters.txt |  11 ++
>  include/linux/dma-contiguous.h|   6 ++
>  kernel/dma/Kconfig|  11 ++
>  kernel/dma/contiguous.c   | 100 --
>  4 files changed, 118 insertions(+), 10 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> b/Documentation/admin-guide/kernel-parameters.txt
> index bdc1f33fd3d1..c609527fc35a 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -599,6 +599,17 @@
>   altogether. For more information, see
>   include/linux/dma-contiguous.h
>  
> + pernuma_cma=nn[MG]

Maybe cma_pernuma or cma_pernode?

> + [ARM64,KNL]
> + Sets the size of kernel per-numa memory area for
> + contiguous memory allocations. A value of 0 disables
> + per-numa CMA altogether. And If this option is not
> + specificed, the default value is 0.
> + With per-numa CMA enabled, DMA users on node nid will
> + first try to allocate buffer from the pernuma area
> + which is located in node nid, if the allocation fails,
> + they will fallback to the global default memory area.
> +
>   cmo_free_hint=  [PPC] Format: { yes | no }
>   Specify whether pages are marked as being inactive
>   when they are freed.  This is used in CMO environments
> diff --git a/include/linux/dma-contiguous.h b/include/linux/dma-contiguous.h
> index 03f8e98e3bcc..fe55e004f1f4 100644
> --- a/include/linux/dma-contiguous.h
> +++ b/include/linux/dma-contiguous.h
> @@ -171,6 +171,12 @@ static inline void dma_free_contiguous(struct device 
> *dev, struct page *page,
>  
>  #endif
>  
> +#ifdef CONFIG_DMA_PERNUMA_CMA
> +void dma_pernuma_cma_reserve(void);
> +#else
> +static inline void dma_pernuma_cma_reserve(void) { }
> +#endif
> +
>  #endif
>  
>  #endif
> diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
> index 847a9d1fa634..c38979d45b13 100644
> --- a/kernel/dma/Kconfig
> +++ b/kernel/dma/Kconfig
> @@ -118,6 +118,17 @@ config DMA_CMA
> If unsure, say "n".
>  
>  if  DMA_CMA
> +
> +config DMA_PERNUMA_CMA
> + bool "Enable separate DMA Contiguous Memory Area for each NUMA Node"
> + default NUMA && ARM64
> + help
> +   Enable this option to get pernuma CMA areas so that devices like
> +   ARM64 SMMU can get local memory by DMA coherent APIs.
> +
> +   You can set the size of pernuma CMA by specifying "pernuma_cma=size"
> +   on the kernel's command line.
> +
>  comment "Default contiguous memory area size:"
>  
>  config CMA_SIZE_MBYTES
> diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c
> index cff7e60968b9..0383c9b86715 100644
> --- a/kernel/dma/contiguous.c
> +++ b/kernel/dma/contiguous.c
> @@ -69,6 +69,19 @@ static int __init early_cma(char *p)
>  }
>  early_param("cma", early_cma);
>  
> +#ifdef CONFIG_DMA_PERNUMA_CMA
> +
> +static struct cma *dma_contiguous_pernuma_area[MAX_NUMNODES];
> +static phys_addr_t pernuma_size_bytes __initdata;
> +
> +static int __init early_pernuma_cma(char *p)
> +{
> + pernuma_size_bytes = memparse(p, );
> + return 0;
> +}
> +early_param("pernuma_cma", early_pernuma_cma);
> +#endif
> +
>  #ifdef CONFIG_CMA_SIZE_PERCENTAGE
>  
>  static phys_addr_t __init __maybe_unused

Re: [Regression] "iommu/amd: Relax locking in dma_ops path" makes tg3 ethernet transmit queue timeout

2020-08-21 Thread Joerg Roedel

Hi Kai,

On Mon, Jun 29, 2020 at 08:33:22PM +0800, Kai-Heng Feng wrote:
> I am still seeing the issue on v5.8-rc3. The issue goes away as soon
> as "iommu=off" is added.

Can you probably help with debugging this issue? I've had no luck so far
getting my hands on a machine with tg3 hardware and an AMD IOMMU.

Regards,

Joerg
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v2 2/2] iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()

2020-08-21 Thread John Garry

It has been shown that the cmpxchg() for finding space in the cmdq can
be a bottleneck:
- for more CPUs contending the cmdq, the cmpxchg() will fail more often
- since the software-maintained cons pointer is updated on the same 64b
  memory region, the chance of cmpxchg() failure increases again

The cmpxchg() is removed as part of 2 related changes:

- Update prod and cmdq owner in a single atomic64 add operation.

  The prod value is updated in one 32b region, while the "owner" is updated
  in another 32b region.

  For the "owner", we now count this value, instead of setting a flag.
  Similar to before, once the owner has finished gathering, it will clear
  a mask. As such, a CPU declares itself as the "owner" when it reads zero
  for this region.

  The owner is now responsible for all cmdq locking to avoid possible
  deadlock. The owner will lock the cmdq for all non-owers it has gathered
  when they have space in the queue and have written their entries.

- Check for space in the cmdq after the prod pointer has been assigned.

  We don't bother checking for space in the cmdq before assigning the prod
  pointer, as this would be super racy.

  So since the prod pointer is updated unconditionally, it would be common
  for no space to be available in the cmdq when prod is assigned - that
  is, according the software-maintained prod and cons pointer. So now
  it must be ensured that the entries are not yet written and not until
  there is space.

  How the prod pointer is maintained also leads to a strange condition
  where the prod pointer can wrap past the cons pointer. We can detect this
  condition, and report no space here. However, a prod pointer progressed
  twice past the cons pointer cannot be detected. But it can be ensured
  that this that this scenario does not occur, as we limit the amount of
  commands any CPU can issue at any given time, such that we cannot
  progress prod pointer further.

Signed-off-by: John Garry 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 120 
 1 file changed, 72 insertions(+), 48 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index a705fa3e18ea..d3c1400c644c 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -521,16 +521,17 @@ struct arm_smmu_cmdq_ent {
 struct arm_smmu_ll_queue {
union {
u64 val;
+   atomic64_t  atomic;
struct {
+   struct {
+   u16 sync;
+   u16 count;
+   };
u32 prod;
-   u32 cons;
};
-   struct {
-   atomic_tprod;
-   atomic_tcons;
-   } atomic;
u8  __pad[SMP_CACHE_BYTES];
} cacheline_aligned_in_smp;
+   u32 cons;
u32 max_n_shift;
u32 max_cmd_per_batch;
 };
@@ -771,10 +772,19 @@ static bool queue_has_space(struct arm_smmu_ll_queue *q, 
u32 n)
prod = Q_IDX(q, q->prod);
cons = Q_IDX(q, q->cons);
 
-   if (Q_WRP(q, q->prod) == Q_WRP(q, q->cons))
+   if (Q_WRP(q, q->prod) == Q_WRP(q, q->cons)) {
+   /* check if we have wrapped, meaning definitely no space */
+   if (cons > prod)
+   return false;
+
space = (1 << q->max_n_shift) - (prod - cons);
-   else
+   } else {
+   /* similar check to above */
+   if (prod > cons)
+   return false;
+
space = cons - prod;
+   }
 
return space >= n;
 }
@@ -1072,7 +1082,7 @@ static void arm_smmu_cmdq_skip_err(struct arm_smmu_device 
*smmu)
  *   fails if the caller appears to be the last lock holder (yes, this is
  *   racy). All successful UNLOCK routines have RELEASE semantics.
  */
-static void arm_smmu_cmdq_shared_lock(struct arm_smmu_cmdq *cmdq)
+static void arm_smmu_cmdq_shared_lock(struct arm_smmu_cmdq *cmdq, int count)
 {
int val;
 
@@ -1082,12 +1092,12 @@ static void arm_smmu_cmdq_shared_lock(struct 
arm_smmu_cmdq *cmdq)
 * to INT_MIN so these increments won't hurt as the value will remain
 * negative.
 */
-   if (atomic_fetch_inc_relaxed(>lock) >= 0)
+   if (atomic_fetch_add_relaxed(count, >lock) >= 0)
return;
 
do {
val = atomic_cond_read_relaxed(>lock, VAL >= 0);
-   } while (atomic_cmpxchg_relaxed(>lock, val, val + 1) != val);
+   } while (atomic_cmpxchg_relaxed(>lock, val, val + count) != val);
 }
 
 static void arm_smmu_cmdq_shared_unlock(struct arm_smmu_cmdq *cmdq)
@@ -1236,13 +1246,14 @@ static int

[PATCH v2 1/2] iommu/arm-smmu-v3: Calculate max commands per batch

2020-08-21 Thread John Garry

Calculate the batch size limit such that all CPUs in the system cannot
issue so many commands as to fill the command queue.

Signed-off-by: John Garry 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 46 +++--
 1 file changed, 42 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 7196207be7ea..a705fa3e18ea 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -532,6 +532,7 @@ struct arm_smmu_ll_queue {
u8  __pad[SMP_CACHE_BYTES];
} cacheline_aligned_in_smp;
u32 max_n_shift;
+   u32 max_cmd_per_batch;
 };
 
 struct arm_smmu_queue {
@@ -1515,7 +1516,10 @@ static void arm_smmu_cmdq_batch_add(struct 
arm_smmu_device *smmu,
struct arm_smmu_cmdq_batch *cmds,
struct arm_smmu_cmdq_ent *cmd)
 {
-   if (cmds->num == CMDQ_BATCH_ENTRIES) {
+   struct arm_smmu_cmdq *q = >cmdq;
+   struct arm_smmu_ll_queue *llq = >q.llq;
+
+   if (cmds->num == llq->max_cmd_per_batch) {
arm_smmu_cmdq_issue_cmdlist(smmu, cmds->cmds, cmds->num, false);
cmds->num = 0;
}
@@ -3177,6 +3181,41 @@ static int arm_smmu_init_one_queue(struct 
arm_smmu_device *smmu,
return 0;
 }
 
+static int arm_smmu_init_cmd_queue(struct arm_smmu_device *smmu,
+  struct arm_smmu_queue *q,
+  unsigned long prod_off,
+  unsigned long cons_off,
+  size_t dwords)
+{
+   u32 cpus = num_possible_cpus(), entries_for_prod;
+   int ret;
+
+   ret = arm_smmu_init_one_queue(smmu, q, prod_off, cons_off, dwords,
+ "cmdq");
+   if (ret)
+   return ret;
+
+   entries_for_prod = 1 << q->llq.max_n_shift;
+
+   /*
+* We need at least 2 commands in a batch (1 x CMD_SYNC and 1 x
+* whatever else).
+*/
+   if (entries_for_prod < 2 * cpus) {
+   dev_err(smmu->dev, "command queue size too small, suggest 
reduce #CPUs\n");
+   return -ENXIO;
+   }
+
+   /*
+* When finding max_cmd_per_batch, deduct 1 entry per batch to take
+* account of a CMD_SYNC being issued also.
+*/
+   q->llq.max_cmd_per_batch = min((entries_for_prod / cpus) - 1,
+  (u32)CMDQ_BATCH_ENTRIES);
+
+   return 0;
+}
+
 static void arm_smmu_cmdq_free_bitmap(void *data)
 {
unsigned long *bitmap = data;
@@ -3210,9 +3249,8 @@ static int arm_smmu_init_queues(struct arm_smmu_device 
*smmu)
int ret;
 
/* cmdq */
-   ret = arm_smmu_init_one_queue(smmu, >cmdq.q, ARM_SMMU_CMDQ_PROD,
- ARM_SMMU_CMDQ_CONS, CMDQ_ENT_DWORDS,
- "cmdq");
+   ret = arm_smmu_init_cmd_queue(smmu, >cmdq.q, ARM_SMMU_CMDQ_PROD,
+ ARM_SMMU_CMDQ_CONS, CMDQ_ENT_DWORDS);
if (ret)
return ret;
 
-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v2 0/2] iommu/arm-smmu-v3: Improve cmdq lock efficiency

2020-08-21 Thread John Garry

As mentioned in [0], the CPU may consume many cycles processing
arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to
get space on the queue takes a lot of time once we start getting many
CPUs contending - from experiment, for 64 CPUs contending the cmdq,
success rate is ~ 1 in 12, which is poor, but not totally awful.

This series removes that cmpxchg() and replaces with an atomic_add,
same as how the actual cmdq deals with maintaining the prod pointer.

For my NVMe test with 3x NVMe SSDs, I'm getting a ~24% throughput
increase:
Before: 1250K IOPs
After: 1550K IOPs

I also have a test harness to check the rate of DMA map+unmaps we can
achieve:

CPU count   8   16  32  64
Before: 282K115K36K 11K
After:  302K193K80K 30K

(unit is map+unmaps per CPU per second)

[0] 
https://lore.kernel.org/linux-iommu/b926444035e5e2439431908e3842afd24b8...@dggemi525-mbs.china.huawei.com/T/#ma02e301c38c3e94b7725e685757c27e39c7cbde3

Differences to v1:
- Simplify by dropping patch to always issue a CMD_SYNC
- Use 64b atomic add, keeping prod in a separate 32b field

John Garry (2):
  iommu/arm-smmu-v3: Calculate max commands per batch
  iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()

 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 166 ++--
 1 file changed, 114 insertions(+), 52 deletions(-)

-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v6 08/15] iommu: Pass domain to sva_unbind_gpasid()

2020-08-21 Thread Auger Eric

Hi,

On 8/20/20 11:06 PM, Alex Williamson wrote:
> On Mon, 27 Jul 2020 23:27:37 -0700
> Liu Yi L  wrote:
> 
>> From: Yi Sun 
>>
>> Current interface is good enough for SVA virtualization on an assigned
>> physical PCI device, but when it comes to mediated devices, a physical
>> device may attached with multiple aux-domains. Also, for guest unbind,
> 
> s/may/may be/
> 
>> the PASID to be unbind should be allocated to the VM. This check requires
>> to know the ioasid_set which is associated with the domain.
>>
>> So this interface needs to pass in domain info. Then the iommu driver is
>> able to know which domain will be used for the 2nd stage translation of
>> the nesting mode and also be able to do PASID ownership check. This patch
>> passes @domain per the above reason. Also, the prototype of  is
>> changed frnt" to "u32" as the below link.
> 
> s/frnt"/from an "int"/
>  
>> https://lore.kernel.org/kvm/27ac7880-bdd3-2891-139e-b4a7cd184...@redhat.com/
> 
> This is really confusing, the link is to Eric's comment asking that the
> conversion from (at the time) int to ioasid_t be included in the commit
> log.  The text here implies that it's pointing to some sort of
> justification for the change, which it isn't.  It just notes that it
> happened, not why it happened, with a mostly irrelevant link.
> 
>> Cc: Kevin Tian 
>> CC: Jacob Pan 
>> Cc: Alex Williamson 
>> Cc: Eric Auger 
>> Cc: Jean-Philippe Brucker 
>> Cc: Joerg Roedel 
>> Cc: Lu Baolu 
>> Reviewed-by: Eric Auger 
>> Signed-off-by: Yi Sun 
>> Signed-off-by: Liu Yi L 
>> ---
>> v5 -> v6:
>> *) use "u32" prototype for @pasid.
>> *) add review-by from Eric Auger.
> 
> I'd probably hold off on adding Eric's R-b given the additional change
> in this version FWIW.  Thanks,

Yep I did not notice that change given the R-b was applied ;-)

Thanks

Eric
> 
> Alex
>  
>> v2 -> v3:
>> *) pass in domain info only
>> *) use u32 for pasid instead of int type
>>
>> v1 -> v2:
>> *) added in v2.
>> ---
>>  drivers/iommu/intel/svm.c   | 3 ++-
>>  drivers/iommu/iommu.c   | 2 +-
>>  include/linux/intel-iommu.h | 3 ++-
>>  include/linux/iommu.h   | 3 ++-
>>  4 files changed, 7 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
>> index c27d16a..c85b8d5 100644
>> --- a/drivers/iommu/intel/svm.c
>> +++ b/drivers/iommu/intel/svm.c
>> @@ -436,7 +436,8 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, 
>> struct device *dev,
>>  return ret;
>>  }
>>  
>> -int intel_svm_unbind_gpasid(struct device *dev, int pasid)
>> +int intel_svm_unbind_gpasid(struct iommu_domain *domain,
>> +struct device *dev, u32 pasid)
>>  {
>>  struct intel_iommu *iommu = intel_svm_device_to_iommu(dev);
>>  struct intel_svm_dev *sdev;
>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>> index 1ce2a61..bee79d7 100644
>> --- a/drivers/iommu/iommu.c
>> +++ b/drivers/iommu/iommu.c
>> @@ -2145,7 +2145,7 @@ int iommu_sva_unbind_gpasid(struct iommu_domain 
>> *domain, struct device *dev,
>>  if (unlikely(!domain->ops->sva_unbind_gpasid))
>>  return -ENODEV;
>>  
>> -return domain->ops->sva_unbind_gpasid(dev, data->hpasid);
>> +return domain->ops->sva_unbind_gpasid(domain, dev, data->hpasid);
>>  }
>>  EXPORT_SYMBOL_GPL(iommu_sva_unbind_gpasid);
>>  
>> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
>> index 0d0ab32..f98146b 100644
>> --- a/include/linux/intel-iommu.h
>> +++ b/include/linux/intel-iommu.h
>> @@ -738,7 +738,8 @@ extern int intel_svm_enable_prq(struct intel_iommu 
>> *iommu);
>>  extern int intel_svm_finish_prq(struct intel_iommu *iommu);
>>  int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev,
>>struct iommu_gpasid_bind_data *data);
>> -int intel_svm_unbind_gpasid(struct device *dev, int pasid);
>> +int intel_svm_unbind_gpasid(struct iommu_domain *domain,
>> +struct device *dev, u32 pasid);
>>  struct iommu_sva *intel_svm_bind(struct device *dev, struct mm_struct *mm,
>>   void *drvdata);
>>  void intel_svm_unbind(struct iommu_sva *handle);
>> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>> index b1ff702..80467fc 100644
>> --- a/include/linux/iommu.h
>> +++ b/include/linux/iommu.h
>> @@ -303,7 +303,8 @@ struct iommu_ops {
>>  int (*sva_bind_gpasid)(struct iommu_domain *domain,
>>  struct device *dev, struct iommu_gpasid_bind_data 
>> *data);
>>  
>> -int (*sva_unbind_gpasid)(struct device *dev, int pasid);
>> +int (*sva_unbind_gpasid)(struct iommu_domain *domain,
>> + struct device *dev, u32 pasid);
>>  
>>  int (*def_domain_type)(struct device *dev);
>>  
> 

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [Regression] "iommu/amd: Relax locking in dma_ops path" makes tg3 ethernet transmit queue timeout

2020-08-21 Thread Kai-Heng Feng

Hi Joerg,

> On Aug 21, 2020, at 21:43, Joerg Roedel  wrote:
> 
> Hi Kai,
> 
> On Mon, Jun 29, 2020 at 08:33:22PM +0800, Kai-Heng Feng wrote:
>> I am still seeing the issue on v5.8-rc3. The issue goes away as soon
>> as "iommu=off" is added.
> 
> Can you probably help with debugging this issue? I've had no luck so far
> getting my hands on a machine with tg3 hardware and an AMD IOMMU.

Of course, I still have the system at my side.

The offending commit is 92d420ec028d ("iommu/amd: Relax locking in dma_ops 
path"), however be62dbf554c5 ("iommu/amd: Convert AMD iommu driver to the 
dma-iommu api") removed .map_page entirely so I don't know where to start.

Kai-Heng

> 
> Regards,
> 
>   Joerg

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v7 3/3] mm: cma: use CMA_MAX_NAME to define the length of cma name array

2020-08-21 Thread Barry Song

CMA_MAX_NAME should be visible to CMA's users as they might need it to set
the name of CMA areas and avoid hardcoding the size locally.
So this patch moves CMA_MAX_NAME from local header file to include/linux
header file and removes the magic number in hugetlb.c and contiguous.c.

Cc: Mike Kravetz 
Cc: Roman Gushchin 
Cc: Christoph Hellwig 
Cc: Marek Szyprowski 
Cc: Will Deacon 
Cc: Robin Murphy 
Cc: Andrew Morton 
Signed-off-by: Barry Song 
---
 this patch is fixing the magic number issue with respect to Will's comment 
here:
 
https://lore.kernel.org/linux-iommu/4ab78767553f48a584217063f6f24...@hisilicon.com/

 include/linux/cma.h | 2 ++
 kernel/dma/contiguous.c | 2 +-
 mm/cma.h| 2 --
 mm/hugetlb.c| 4 ++--
 4 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/cma.h b/include/linux/cma.h
index 6ff79fefd01f..217999c8a762 100644
--- a/include/linux/cma.h
+++ b/include/linux/cma.h
@@ -18,6 +18,8 @@
 
 #endif
 
+#define CMA_MAX_NAME 64
+
 struct cma;
 
 extern unsigned long totalcma_pages;
diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c
index 0383c9b86715..d2d6b715c274 100644
--- a/kernel/dma/contiguous.c
+++ b/kernel/dma/contiguous.c
@@ -119,7 +119,7 @@ void __init dma_pernuma_cma_reserve(void)
 
for_each_online_node(nid) {
int ret;
-   char name[20];
+   char name[CMA_MAX_NAME];
struct cma **cma = _contiguous_pernuma_area[nid];
 
snprintf(name, sizeof(name), "pernuma%d", nid);
diff --git a/mm/cma.h b/mm/cma.h
index 20f6e24bc477..42ae082cb067 100644
--- a/mm/cma.h
+++ b/mm/cma.h
@@ -4,8 +4,6 @@
 
 #include 
 
-#define CMA_MAX_NAME 64
-
 struct cma {
unsigned long   base_pfn;
unsigned long   count;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a301c2d672bf..9eec0ea9ba68 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5683,12 +5683,12 @@ void __init hugetlb_cma_reserve(int order)
reserved = 0;
for_each_node_state(nid, N_ONLINE) {
int res;
-   char name[20];
+   char name[CMA_MAX_NAME];
 
size = min(per_node, hugetlb_cma_size - reserved);
size = round_up(size, PAGE_SIZE << order);
 
-   snprintf(name, 20, "hugetlb%d", nid);
+   snprintf(name, sizeof(name), "hugetlb%d", nid);
res = cma_declare_contiguous_nid(0, size, 0, PAGE_SIZE << order,
 0, false, name,
 _cma[nid], nid);
-- 
2.27.0


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v7 1/3] dma-contiguous: provide the ability to reserve per-numa CMA

2020-08-21 Thread Barry Song

Right now, drivers like ARM SMMU are using dma_alloc_coherent() to get
coherent DMA buffers to save their command queues and page tables. As
there is only one default CMA in the whole system, SMMUs on nodes other
than node0 will get remote memory. This leads to significant latency.

This patch provides per-numa CMA so that drivers like SMMU can get local
memory. Tests show localizing CMA can decrease dma_unmap latency much.
For instance, before this patch, SMMU on node2  has to wait for more than
560ns for the completion of CMD_SYNC in an empty command queue; with this
patch, it needs 240ns only.

A positive side effect of this patch would be improving performance even
further for those users who are worried about performance more than DMA
security and use iommu.passthrough=1 to skip IOMMU. With local CMA, all
drivers can get local coherent DMA buffers.

Cc: Jonathan Cameron 
Cc: Christoph Hellwig 
Cc: Marek Szyprowski 
Cc: Will Deacon 
Cc: Robin Murphy 
Cc: Ganapatrao Kulkarni 
Cc: Catalin Marinas 
Cc: Nicolas Saenz Julienne 
Cc: Steve Capper 
Cc: Andrew Morton 
Cc: Mike Rapoport 
Signed-off-by: Barry Song 
---
 -v7: with respect to Will's comments
 * move to use for_each_online_node
 * add description if users don't specify pernuma_cma
 * provide default value for CONFIG_DMA_PERNUMA_CMA

 .../admin-guide/kernel-parameters.txt |  11 ++
 include/linux/dma-contiguous.h|   6 ++
 kernel/dma/Kconfig|  11 ++
 kernel/dma/contiguous.c   | 100 --
 4 files changed, 118 insertions(+), 10 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index bdc1f33fd3d1..c609527fc35a 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -599,6 +599,17 @@
altogether. For more information, see
include/linux/dma-contiguous.h
 
+   pernuma_cma=nn[MG]
+   [ARM64,KNL]
+   Sets the size of kernel per-numa memory area for
+   contiguous memory allocations. A value of 0 disables
+   per-numa CMA altogether. And If this option is not
+   specificed, the default value is 0.
+   With per-numa CMA enabled, DMA users on node nid will
+   first try to allocate buffer from the pernuma area
+   which is located in node nid, if the allocation fails,
+   they will fallback to the global default memory area.
+
cmo_free_hint=  [PPC] Format: { yes | no }
Specify whether pages are marked as being inactive
when they are freed.  This is used in CMO environments
diff --git a/include/linux/dma-contiguous.h b/include/linux/dma-contiguous.h
index 03f8e98e3bcc..fe55e004f1f4 100644
--- a/include/linux/dma-contiguous.h
+++ b/include/linux/dma-contiguous.h
@@ -171,6 +171,12 @@ static inline void dma_free_contiguous(struct device *dev, 
struct page *page,
 
 #endif
 
+#ifdef CONFIG_DMA_PERNUMA_CMA
+void dma_pernuma_cma_reserve(void);
+#else
+static inline void dma_pernuma_cma_reserve(void) { }
+#endif
+
 #endif
 
 #endif
diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
index 847a9d1fa634..c38979d45b13 100644
--- a/kernel/dma/Kconfig
+++ b/kernel/dma/Kconfig
@@ -118,6 +118,17 @@ config DMA_CMA
  If unsure, say "n".
 
 if  DMA_CMA
+
+config DMA_PERNUMA_CMA
+   bool "Enable separate DMA Contiguous Memory Area for each NUMA Node"
+   default NUMA && ARM64
+   help
+ Enable this option to get pernuma CMA areas so that devices like
+ ARM64 SMMU can get local memory by DMA coherent APIs.
+
+ You can set the size of pernuma CMA by specifying "pernuma_cma=size"
+ on the kernel's command line.
+
 comment "Default contiguous memory area size:"
 
 config CMA_SIZE_MBYTES
diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c
index cff7e60968b9..0383c9b86715 100644
--- a/kernel/dma/contiguous.c
+++ b/kernel/dma/contiguous.c
@@ -69,6 +69,19 @@ static int __init early_cma(char *p)
 }
 early_param("cma", early_cma);
 
+#ifdef CONFIG_DMA_PERNUMA_CMA
+
+static struct cma *dma_contiguous_pernuma_area[MAX_NUMNODES];
+static phys_addr_t pernuma_size_bytes __initdata;
+
+static int __init early_pernuma_cma(char *p)
+{
+   pernuma_size_bytes = memparse(p, );
+   return 0;
+}
+early_param("pernuma_cma", early_pernuma_cma);
+#endif
+
 #ifdef CONFIG_CMA_SIZE_PERCENTAGE
 
 static phys_addr_t __init __maybe_unused cma_early_percent_memory(void)
@@ -96,6 +109,34 @@ static inline __maybe_unused phys_addr_t 
cma_early_percent_memory(void)
 
 #endif
 
+#ifdef CONFIG_DMA_PERNUMA_CMA
+void __init dma_pernuma_cma_reserve(void)
+{
+   int nid;
+
+   if (!pernuma_size_bytes)
+   return;
+
+

[PATCH v7 0/3] make dma_alloc_coherent NUMA-aware by per-NUMA CMA

2020-08-21 Thread Barry Song

Ganapatrao Kulkarni has put some effort on making arm-smmu-v3 use local
memory to save command queues[1]. I also did similar job in patch
"iommu/arm-smmu-v3: allocate the memory of queues in local numa node"
[2] while not realizing Ganapatrao has done that before.

But it seems it is much better to make dma_alloc_coherent() to be
inherently NUMA-aware on NUMA-capable systems.

Right now, smmu is using dma_alloc_coherent() to get memory to save queues
and tables. Typically, on ARM64 server, there is a default CMA located at
node0, which could be far away from node2, node3 etc.
Saving queues and tables remotely will increase the latency of ARM SMMU
significantly. For example, when SMMU is at node2 and the default global
CMA is at node0, after sending a CMD_SYNC in an empty command queue, we
have to wait more than 550ns for the completion of the command CMD_SYNC.
However, if we save them locally, we only need to wait for 240ns.

with per-numa CMA, smmu will get memory from local numa node to save command
queues and page tables. that means dma_unmap latency will be shrunk much.

Meanwhile, when iommu.passthrough is on, device drivers which call dma_
alloc_coherent() will also get local memory and avoid the travel between
numa nodes.

[1] https://lists.linuxfoundation.org/pipermail/iommu/2017-October/024455.html
[2] https://www.spinics.net/lists/iommu/msg44767.html


-v7:
 * add Will's acked-by for the change in arch/arm64
 * some cleanup with respect to Will's comments
 * add patch 3/3 to remove the hardcode of defining the size of cma name.
   this patch requires some header file change in include/linux

-v6:
 * rebase on top of 5.9-rc1
 * doc cleanup

-v5:
 refine code according to Christoph Hellwig's comments
 * remove Kconfig option for pernuma cma size;
 * add Kconfig option for pernuma cma enable;
 * code cleanup like line over 80 char

 I haven't removed the cma NULL check code in cma_alloc() as it requires
 a bundle of other changes. So I prefer to handle this issue separately.

-v4:
 * rebase on top of Christoph Hellwig's patch:
 [PATCH v2] dma-contiguous: cleanup dma_alloc_contiguous
 https://lore.kernel.org/linux-iommu/20200723120133.94105-1-...@lst.de/
 * cleanup according to Christoph's comment
 * rebase on top of linux-next to avoid arch/arm64 conflicts
 * reserve cma by checking N_MEMORY rather than N_ONLINE

-v3:
  * move to use page_to_nid() while freeing cma with respect to Robin's
  comment, but this will only work after applying my below patch:
  "mm/cma.c: use exact_nid true to fix possible per-numa cma leak"
  https://marc.info/?l=linux-mm=159333034726647=2

  * handle the case count <= 1 more properly according to Robin's
  comment;

  * add pernuma_cma parameter to support dynamic setting of per-numa
  cma size;
  ideally we can leverage the CMA_SIZE_MBYTES, CMA_SIZE_PERCENTAGE and
  "cma=" kernel parameter and avoid a new paramter separately for per-
  numa cma. Practically, it is really too complicated considering the
  below problems:
  (1) if we leverage the size of default numa for per-numa, we have to
  avoid creating two cma with same size in node0 since default cma is
  probably on node0.
  (2) default cma can consider the address limitation for old devices
  while per-numa cma doesn't support GFP_DMA and GFP_DMA32. all
  allocations with limitation flags will fallback to default one.
  (3) hard to apply CMA_SIZE_PERCENTAGE to per-numa. it is hard to
  decide if the percentage should apply to the whole memory size
  or only apply to the memory size of a specific numa node.
  (4) default cma size has CMA_SIZE_SEL_MIN and CMA_SIZE_SEL_MAX, it
  makes things even more complicated to per-numa cma.

  I haven't figured out a good way to leverage the size of default cma
  for per-numa cma. it seems a separate parameter for per-numa could
  make life easier.

  * move dma_pernuma_cma_reserve() after hugetlb_cma_reserve() to
  reuse the comment before hugetlb_cma_reserve() with respect to
  Robin's comment

-v2: 
  * fix some issues reported by kernel test robot
  * fallback to default cma while allocation fails in per-numa cma
 free memory properly

Barry Song (3):
  dma-contiguous: provide the ability to reserve per-numa CMA
  arm64: mm: reserve per-numa CMA to localize coherent dma buffers
  mm: cma: use CMA_MAX_NAME to define the length of cma name array

 .../admin-guide/kernel-parameters.txt |  11 ++
 arch/arm64/mm/init.c  |   2 +
 include/linux/cma.h   |   2 +
 include/linux/dma-contiguous.h|   6 ++
 kernel/dma/Kconfig|  11 ++
 kernel/dma/contiguous.c   | 100 --
 mm/cma.h  |   2 -
 mm/hugetlb.c  |   4 +-
 8 files changed, 124 insertions(+), 14 deletions(-)

-- 
2.27.0


___
iommu mailing list
iommu@lists.linux-foundation.org

RE: [PATCH v6 1/2] dma-contiguous: provide the ability to reserve per-numa CMA

2020-08-21 Thread Song Bao Hua (Barry Song)




> -Original Message-
> From: Will Deacon [mailto:w...@kernel.org]
> Sent: Friday, August 21, 2020 8:47 PM
> To: Song Bao Hua (Barry Song) 
> Cc: h...@lst.de; m.szyprow...@samsung.com; robin.mur...@arm.com;
> ganapatrao.kulka...@cavium.com; catalin.mari...@arm.com;
> iommu@lists.linux-foundation.org; Linuxarm ;
> linux-arm-ker...@lists.infradead.org; linux-ker...@vger.kernel.org;
> huangdaode ; Jonathan Cameron
> ; Nicolas Saenz Julienne
> ; Steve Capper ; Andrew
> Morton ; Mike Rapoport 
> Subject: Re: [PATCH v6 1/2] dma-contiguous: provide the ability to reserve
> per-numa CMA
> 
> On Fri, Aug 21, 2020 at 02:26:14PM +1200, Barry Song wrote:
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt
> b/Documentation/admin-guide/kernel-parameters.txt
> > index bdc1f33fd3d1..3f33b89aeab5 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -599,6 +599,15 @@
> > altogether. For more information, see
> > include/linux/dma-contiguous.h
> >
> > +   pernuma_cma=nn[MG]
> > +   [ARM64,KNL]
> > +   Sets the size of kernel per-numa memory area for
> > +   contiguous memory allocations. A value of 0 disables
> > +   per-numa CMA altogether. DMA users on node nid will
> > +   first try to allocate buffer from the pernuma area
> > +   which is located in node nid, if the allocation fails,
> > +   they will fallback to the global default memory area.
> 
> What is the default behaviour if this option is not specified? Seems like
> that should be mentioned here.
> 
> > diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
> > index 847a9d1fa634..db7a37ed35eb 100644
> > --- a/kernel/dma/Kconfig
> > +++ b/kernel/dma/Kconfig
> > @@ -118,6 +118,16 @@ config DMA_CMA
> >   If unsure, say "n".
> >
> >  if  DMA_CMA
> > +
> > +config DMA_PERNUMA_CMA
> > +   bool "Enable separate DMA Contiguous Memory Area for each NUMA
> Node"
> 
> I don't understand the need for this config option. If you have DMA_DMA and
> you have NUMA, why wouldn't you want this enabled?

Christoph preferred this in previous patchset in order to be able to remove all 
of the code
in the text if users don't use pernuma CMA.

> 
> > +   help
> > + Enable this option to get pernuma CMA areas so that devices like
> > + ARM64 SMMU can get local memory by DMA coherent APIs.
> > +
> > + You can set the size of pernuma CMA by specifying
> "pernuma_cma=size"
> > + on the kernel's command line.
> > +
> >  comment "Default contiguous memory area size:"
> >
> >  config CMA_SIZE_MBYTES
> > diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c
> > index cff7e60968b9..89b95f10e56d 100644
> > --- a/kernel/dma/contiguous.c
> > +++ b/kernel/dma/contiguous.c
> > @@ -69,6 +69,19 @@ static int __init early_cma(char *p)
> >  }
> >  early_param("cma", early_cma);
> >
> > +#ifdef CONFIG_DMA_PERNUMA_CMA
> > +
> > +static struct cma *dma_contiguous_pernuma_area[MAX_NUMNODES];
> > +static phys_addr_t pernuma_size_bytes __initdata;
> > +
> > +static int __init early_pernuma_cma(char *p)
> > +{
> > +   pernuma_size_bytes = memparse(p, );
> > +   return 0;
> > +}
> > +early_param("pernuma_cma", early_pernuma_cma);
> > +#endif
> > +
> >  #ifdef CONFIG_CMA_SIZE_PERCENTAGE
> >
> >  static phys_addr_t __init __maybe_unused
> cma_early_percent_memory(void)
> > @@ -96,6 +109,34 @@ static inline __maybe_unused phys_addr_t
> cma_early_percent_memory(void)
> >
> >  #endif
> >
> > +#ifdef CONFIG_DMA_PERNUMA_CMA
> > +void __init dma_pernuma_cma_reserve(void)
> > +{
> > +   int nid;
> > +
> > +   if (!pernuma_size_bytes)
> > +   return;
> 
> If this is useful (I assume it is), then I think we should have a non-zero
> default value, a bit like normal CMA does via CMA_SIZE_MBYTES.

The patchet used to have a CONFIG_PERNUMA_CMA_SIZE in kernel/dma/Kconfig, but 
Christoph was not comfortable
with it:
https://lore.kernel.org/linux-iommu/20200728115231.ga...@lst.de/

Would you mind to hardcode the value in CONFIG_CMDLINE in arch/arm64/Kconfig as 
Christoph mentioned:
config CMDLINE
default "pernuma_cma=16M"

If you also don't like the change in arch/arm64/Kconfig CMDLINE, I guess I have 
to depend on users' setting in
cmdline just like hugetlb_cma.

> 
> > +   for_each_node_state(nid, N_ONLINE) {
> 
> for_each_online_node() {
> 
> > +   int ret;
> > +   char name[20];
> 
> 20?
> 
> Ah, wait, this is copy-pasta from hugetlb_cma_reserve(). Can you factor out
> the common parts at all?

Actually I have a "#define CMA_MAX_NAME 64" in this commit:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=18e98e56f440

the 20 in hugetlb_cma_reserve() was also made by me. If you are not 
comfortable, I can
move to CMA_MAX_NAME. do you think it does really matter here? 20 seems to be 
long

[PATCH v3 2/6] iommu/virtio: Add topology helpers

2020-08-21 Thread Jean-Philippe Brucker

To support topology description from ACPI and from the builtin
description, add helpers to keep track of I/O topology descriptors.

To ease re-use of the helpers by other drivers and future ACPI
extensions, use the "virt_" prefix rather than "virtio_" when naming
structs and functions.

Signed-off-by: Jean-Philippe Brucker 
---
 drivers/iommu/Kconfig   |   3 +
 drivers/iommu/virtio/Makefile   |   1 +
 drivers/iommu/virtio/topology-helpers.h |  50 ++
 include/linux/virt_iommu.h  |  15 ++
 drivers/iommu/virtio/topology-helpers.c | 196 
 drivers/iommu/virtio/virtio-iommu.c |   4 +
 MAINTAINERS |   1 +
 7 files changed, 270 insertions(+)
 create mode 100644 drivers/iommu/virtio/topology-helpers.h
 create mode 100644 include/linux/virt_iommu.h
 create mode 100644 drivers/iommu/virtio/topology-helpers.c

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index bef5d75e306b..e29ae50f7100 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -391,4 +391,7 @@ config VIRTIO_IOMMU
 
  Say Y here if you intend to run this kernel as a guest.
 
+config VIRTIO_IOMMU_TOPOLOGY_HELPERS
+   bool
+
 endif # IOMMU_SUPPORT
diff --git a/drivers/iommu/virtio/Makefile b/drivers/iommu/virtio/Makefile
index 279368fcc074..b42ad47eac7e 100644
--- a/drivers/iommu/virtio/Makefile
+++ b/drivers/iommu/virtio/Makefile
@@ -1,2 +1,3 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
+obj-$(CONFIG_VIRTIO_IOMMU_TOPOLOGY_HELPERS) += topology-helpers.o
diff --git a/drivers/iommu/virtio/topology-helpers.h 
b/drivers/iommu/virtio/topology-helpers.h
new file mode 100644
index ..436ca6a900c5
--- /dev/null
+++ b/drivers/iommu/virtio/topology-helpers.h
@@ -0,0 +1,50 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef TOPOLOGY_HELPERS_H_
+#define TOPOLOGY_HELPERS_H_
+
+#ifdef CONFIG_VIRTIO_IOMMU_TOPOLOGY_HELPERS
+
+/* Identify a device node in the topology */
+struct virt_topo_dev_id {
+   unsigned inttype;
+#define VIRT_TOPO_DEV_TYPE_PCI 1
+#define VIRT_TOPO_DEV_TYPE_MMIO2
+   union {
+   /* PCI endpoint or range */
+   struct {
+   u16 segment;
+   u16 bdf_start;
+   u16 bdf_end;
+   };
+   /* MMIO region */
+   u64 base;
+   };
+};
+
+/* Specification of an IOMMU */
+struct virt_topo_iommu {
+   struct virt_topo_dev_id dev_id;
+   struct device   *dev; /* transport device */
+   struct fwnode_handle*fwnode;
+   struct iommu_ops*ops;
+   struct list_headlist;
+};
+
+/* Specification of an endpoint */
+struct virt_topo_endpoint {
+   struct virt_topo_dev_id dev_id;
+   u32 endpoint_id;
+   struct virt_topo_iommu  *viommu;
+   struct list_headlist;
+};
+
+void virt_topo_add_endpoint(struct virt_topo_endpoint *ep);
+void virt_topo_add_iommu(struct virt_topo_iommu *viommu);
+
+void virt_topo_set_iommu_ops(struct device *dev, struct iommu_ops *ops);
+
+#else /* !CONFIG_VIRTIO_IOMMU_TOPOLOGY_HELPERS */
+static inline void virt_topo_set_iommu_ops(struct device *dev, struct 
iommu_ops *ops)
+{ }
+#endif /* !CONFIG_VIRTIO_IOMMU_TOPOLOGY_HELPERS */
+#endif /* TOPOLOGY_HELPERS_H_ */
diff --git a/include/linux/virt_iommu.h b/include/linux/virt_iommu.h
new file mode 100644
index ..17d2bd4732e0
--- /dev/null
+++ b/include/linux/virt_iommu.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef VIRT_IOMMU_H_
+#define VIRT_IOMMU_H_
+
+#ifdef CONFIG_VIRTIO_IOMMU_TOPOLOGY_HELPERS
+int virt_dma_configure(struct device *dev);
+
+#else /* !CONFIG_VIRTIO_IOMMU_TOPOLOGY_HELPERS */
+static inline int virt_dma_configure(struct device *dev)
+{
+   /* Don't disturb the normal DMA configuration methods */
+   return 0;
+}
+#endif /* !CONFIG_VIRTIO_IOMMU_TOPOLOGY_HELPERS */
+#endif /* VIRT_IOMMU_H_ */
diff --git a/drivers/iommu/virtio/topology-helpers.c 
b/drivers/iommu/virtio/topology-helpers.c
new file mode 100644
index ..8815e3a5d431
--- /dev/null
+++ b/drivers/iommu/virtio/topology-helpers.c
@@ -0,0 +1,196 @@
+// SPDX-License-Identifier: GPL-2.0
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "topology-helpers.h"
+
+static LIST_HEAD(viommus);
+static LIST_HEAD(pci_endpoints);
+static LIST_HEAD(mmio_endpoints);
+static DEFINE_MUTEX(viommus_lock);
+
+static bool virt_topo_device_match(struct device *dev,
+  struct virt_topo_dev_id *id)
+{
+   if (id->type == VIRT_TOPO_DEV_TYPE_PCI && dev_is_pci(dev)) {
+   struct pci_dev *pdev = to_pci_dev(dev);
+

[PATCH v3 1/6] iommu/virtio: Move to drivers/iommu/virtio/

2020-08-21 Thread Jean-Philippe Brucker

Before adding new files to the virtio-iommu driver, move it to its own
subfolder, similarly to other IOMMU drivers.

Signed-off-by: Jean-Philippe Brucker 
---
 drivers/iommu/Makefile| 3 +--
 drivers/iommu/virtio/Makefile | 2 ++
 drivers/iommu/{ => virtio}/virtio-iommu.c | 0
 MAINTAINERS   | 2 +-
 4 files changed, 4 insertions(+), 3 deletions(-)
 create mode 100644 drivers/iommu/virtio/Makefile
 rename drivers/iommu/{ => virtio}/virtio-iommu.c (100%)

diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index 11f1771104f3..fc7523042512 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -1,5 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0
-obj-y += amd/ intel/ arm/
+obj-y += amd/ intel/ arm/ virtio/
 obj-$(CONFIG_IOMMU_API) += iommu.o
 obj-$(CONFIG_IOMMU_API) += iommu-traces.o
 obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o
@@ -26,4 +26,3 @@ obj-$(CONFIG_EXYNOS_IOMMU) += exynos-iommu.o
 obj-$(CONFIG_FSL_PAMU) += fsl_pamu.o fsl_pamu_domain.o
 obj-$(CONFIG_S390_IOMMU) += s390-iommu.o
 obj-$(CONFIG_HYPERV_IOMMU) += hyperv-iommu.o
-obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
diff --git a/drivers/iommu/virtio/Makefile b/drivers/iommu/virtio/Makefile
new file mode 100644
index ..279368fcc074
--- /dev/null
+++ b/drivers/iommu/virtio/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0
+obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio/virtio-iommu.c
similarity index 100%
rename from drivers/iommu/virtio-iommu.c
rename to drivers/iommu/virtio/virtio-iommu.c
diff --git a/MAINTAINERS b/MAINTAINERS
index deaafb617361..3602b223c9b2 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -18451,7 +18451,7 @@ VIRTIO IOMMU DRIVER
 M: Jean-Philippe Brucker 
 L: virtualizat...@lists.linux-foundation.org
 S: Maintained
-F: drivers/iommu/virtio-iommu.c
+F: drivers/iommu/virtio/
 F: include/uapi/linux/virtio_iommu.h
 
 VIRTIO MEM DRIVER
-- 
2.28.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v3 5/6] iommu/virtio: Support topology description in config space

2020-08-21 Thread Jean-Philippe Brucker

Platforms without device-tree nor ACPI can provide a topology
description embedded into the virtio config space. Parse it.

Use PCI FIXUP to probe the config space early, because we need to
discover the topology before any DMA configuration takes place, and the
virtio driver may be loaded much later. Since we discover the topology
description when probing the PCI hierarchy, the virtual IOMMU cannot
manage other platform devices discovered earlier.

Signed-off-by: Jean-Philippe Brucker 
---
 drivers/iommu/Kconfig   |  12 ++
 drivers/iommu/virtio/Makefile   |   1 +
 drivers/iommu/virtio/topology.c | 259 
 3 files changed, 272 insertions(+)
 create mode 100644 drivers/iommu/virtio/topology.c

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index e29ae50f7100..98d28fdbc19a 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -394,4 +394,16 @@ config VIRTIO_IOMMU
 config VIRTIO_IOMMU_TOPOLOGY_HELPERS
bool
 
+config VIRTIO_IOMMU_TOPOLOGY
+   bool "Handle topology properties from the virtio-iommu"
+   depends on VIRTIO_IOMMU
+   depends on PCI
+   default y
+   select VIRTIO_IOMMU_TOPOLOGY_HELPERS
+   help
+ Enable early probing of virtio-iommu devices to detect the built-in
+ topology description.
+
+ Say Y here if you intend to run this kernel as a guest.
+
 endif # IOMMU_SUPPORT
diff --git a/drivers/iommu/virtio/Makefile b/drivers/iommu/virtio/Makefile
index b42ad47eac7e..1eda8ca1cbbf 100644
--- a/drivers/iommu/virtio/Makefile
+++ b/drivers/iommu/virtio/Makefile
@@ -1,3 +1,4 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
+obj-$(CONFIG_VIRTIO_IOMMU_TOPOLOGY) += topology.o
 obj-$(CONFIG_VIRTIO_IOMMU_TOPOLOGY_HELPERS) += topology-helpers.o
diff --git a/drivers/iommu/virtio/topology.c b/drivers/iommu/virtio/topology.c
new file mode 100644
index ..4923eec618b9
--- /dev/null
+++ b/drivers/iommu/virtio/topology.c
@@ -0,0 +1,259 @@
+// SPDX-License-Identifier: GPL-2.0
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "topology-helpers.h"
+
+struct viommu_cap_config {
+   u8 bar;
+   u32 length; /* structure size */
+   u32 offset; /* structure offset within the bar */
+};
+
+struct viommu_topo_header {
+   u8 type;
+   u8 reserved;
+   u16 length;
+};
+
+static struct virt_topo_endpoint *
+viommu_parse_node(void __iomem *buf, size_t len)
+{
+   int ret = -EINVAL;
+   union {
+   struct viommu_topo_header hdr;
+   struct virtio_iommu_topo_pci_range pci;
+   struct virtio_iommu_topo_mmio mmio;
+   } __iomem *cfg = buf;
+   struct virt_topo_endpoint *spec;
+
+   spec = kzalloc(sizeof(*spec), GFP_KERNEL);
+   if (!spec)
+   return ERR_PTR(-ENOMEM);
+
+   switch (ioread8(>hdr.type)) {
+   case VIRTIO_IOMMU_TOPO_PCI_RANGE:
+   if (len < sizeof(cfg->pci))
+   goto err_free;
+
+   spec->dev_id.type = VIRT_TOPO_DEV_TYPE_PCI;
+   spec->dev_id.segment = ioread16(>pci.segment);
+   spec->dev_id.bdf_start = ioread16(>pci.bdf_start);
+   spec->dev_id.bdf_end = ioread16(>pci.bdf_end);
+   spec->endpoint_id = ioread32(>pci.endpoint_start);
+   break;
+   case VIRTIO_IOMMU_TOPO_MMIO:
+   if (len < sizeof(cfg->mmio))
+   goto err_free;
+
+   spec->dev_id.type = VIRT_TOPO_DEV_TYPE_MMIO;
+   spec->dev_id.base = ioread64(>mmio.address);
+   spec->endpoint_id = ioread32(>mmio.endpoint);
+   break;
+   default:
+   pr_warn("unhandled format 0x%x\n", ioread8(>hdr.type));
+   ret = 0;
+   goto err_free;
+   }
+   return spec;
+
+err_free:
+   kfree(spec);
+   return ERR_PTR(ret);
+}
+
+static int viommu_parse_topology(struct device *dev,
+struct virtio_iommu_config __iomem *cfg,
+size_t max_len)
+{
+   int ret;
+   u16 len;
+   size_t i;
+   LIST_HEAD(endpoints);
+   size_t offset, count;
+   struct virt_topo_iommu *viommu;
+   struct virt_topo_endpoint *ep, *next;
+   struct viommu_topo_header __iomem *cur;
+
+   offset = ioread16(>topo_config.offset);
+   count = ioread16(>topo_config.count);
+   if (!offset || !count)
+   return 0;
+
+   viommu = kzalloc(sizeof(*viommu), GFP_KERNEL);
+   if (!viommu)
+   return -ENOMEM;
+
+   viommu->dev = dev;
+
+   for (i = 0; i < count; i++, offset += len) {
+   if (offset + sizeof(*cur) > max_len) {
+   ret = -EOVERFLOW;
+   goto err_free;
+   }
+
+   cur = (void __iomem *)cfg +

[PATCH v3 6/6] iommu/virtio: Enable x86 support

2020-08-21 Thread Jean-Philippe Brucker

With the built-in topology description in place, x86 platforms can now
use the virtio-iommu.

Architectures that use the generic iommu_dma_ops should normally select
CONFIG_IOMMU_DMA themselves (from arch/*/Kconfig). Since not all x86
drivers have been converted yet, it's currently up to the IOMMU Kconfig
to select it.

Signed-off-by: Jean-Philippe Brucker 
---
 drivers/iommu/Kconfig | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 98d28fdbc19a..d7cf158745eb 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -383,8 +383,9 @@ config HYPERV_IOMMU
 config VIRTIO_IOMMU
tristate "Virtio IOMMU driver"
depends on VIRTIO
-   depends on ARM64
+   depends on (ARM64 || X86)
select IOMMU_API
+   select IOMMU_DMA if X86
select INTERVAL_TREE
help
  Para-virtualised IOMMU driver with virtio.
-- 
2.28.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v3 0/6] Add virtio-iommu built-in topology

2020-08-21 Thread Jean-Philippe Brucker

Add a topology description to the virtio-iommu driver and enable x86
platforms.

Since [v2] we have made some progress on adding ACPI support for
virtio-iommu, which is the preferred boot method on x86. It will be a
new vendor-agnostic table describing para-virtual topologies in a
minimal format. However some platforms don't use either ACPI or DT for
booting (for example microvm), and will need the alternative topology
description method proposed here. In addition, since the process to get
a new ACPI table will take a long time, this provides a boot method even
to ACPI-based platforms, if only temporarily for testing and
development.

v3:
* Add patch 1 that moves virtio-iommu to a subfolder.
* Split the rest:
  * Patch 2 adds topology-helper.c, which will be shared with the ACPI
support.
  * Patch 4 adds definitions.
  * Patch 5 adds parser in topology.c.
* Address other comments.

Linux and QEMU patches available at:
https://jpbrucker.net/git/linux virtio-iommu/devel
https://jpbrucker.net/git/qemu virtio-iommu/devel

[spec] https://lists.oasis-open.org/archives/virtio-dev/202008/msg00067.html
[v2] 
https://lore.kernel.org/linux-iommu/20200228172537.377327-1-jean-phili...@linaro.org/
[v1] 
https://lore.kernel.org/linux-iommu/20200214160413.1475396-1-jean-phili...@linaro.org/
[rfc] 
https://lore.kernel.org/linux-iommu/20191122105000.800410-1-jean-phili...@linaro.org/

Jean-Philippe Brucker (6):
  iommu/virtio: Move to drivers/iommu/virtio/
  iommu/virtio: Add topology helpers
  PCI: Add DMA configuration for virtual platforms
  iommu/virtio: Add topology definitions
  iommu/virtio: Support topology description in config space
  iommu/virtio: Enable x86 support

 drivers/iommu/Kconfig |  18 +-
 drivers/iommu/Makefile|   3 +-
 drivers/iommu/virtio/Makefile |   4 +
 drivers/iommu/virtio/topology-helpers.h   |  50 +
 include/linux/virt_iommu.h|  15 ++
 include/uapi/linux/virtio_iommu.h |  44 
 drivers/iommu/virtio/topology-helpers.c   | 196 
 drivers/iommu/virtio/topology.c   | 259 ++
 drivers/iommu/{ => virtio}/virtio-iommu.c |   4 +
 drivers/pci/pci-driver.c  |   5 +
 MAINTAINERS   |   3 +-
 11 files changed, 597 insertions(+), 4 deletions(-)
 create mode 100644 drivers/iommu/virtio/Makefile
 create mode 100644 drivers/iommu/virtio/topology-helpers.h
 create mode 100644 include/linux/virt_iommu.h
 create mode 100644 drivers/iommu/virtio/topology-helpers.c
 create mode 100644 drivers/iommu/virtio/topology.c
 rename drivers/iommu/{ => virtio}/virtio-iommu.c (99%)

-- 
2.28.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v3 3/6] PCI: Add DMA configuration for virtual platforms

2020-08-21 Thread Jean-Philippe Brucker

Hardware platforms usually describe the IOMMU topology using either
device-tree pointers or vendor-specific ACPI tables.  For virtual
platforms that don't provide a device-tree, the virtio-iommu device
contains a description of the endpoints it manages.  That information
allows us to probe endpoints after the IOMMU is probed (possibly as late
as userspace modprobe), provided it is discovered early enough.

Add a hook to pci_dma_configure(), which returns -EPROBE_DEFER if the
endpoint is managed by a vIOMMU that will be loaded later, or 0 in any
other case to avoid disturbing the normal DMA configuration methods.
When CONFIG_VIRTIO_IOMMU_TOPOLOGY_HELPERS isn't selected, the call to
virt_dma_configure() is compiled out.

As long as the information is consistent, platforms can provide both a
device-tree and a built-in topology, and the IOMMU infrastructure is
able to deal with multiple DMA configuration methods.

Acked-by: Bjorn Helgaas 
Signed-off-by: Jean-Philippe Brucker 
---
 drivers/pci/pci-driver.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 449466f71040..dbe9d33606b0 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "pci.h"
 #include "pcie/portdrv.h"
 
@@ -1605,6 +1606,10 @@ static int pci_dma_configure(struct device *dev)
struct device *bridge;
int ret = 0;
 
+   ret = virt_dma_configure(dev);
+   if (ret)
+   return ret;
+
bridge = pci_get_host_bridge_device(to_pci_dev(dev));
 
if (IS_ENABLED(CONFIG_OF) && bridge->parent &&
-- 
2.28.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v3 4/6] iommu/virtio: Add topology definitions

2020-08-21 Thread Jean-Philippe Brucker

Add struct definitions for describing endpoints managed by the
virtio-iommu. When VIRTIO_IOMMU_F_TOPOLOGY is offered, an array of
virtio_iommu_topo_* structures in config space describes the endpoints,
identified either by their PCI BDF or their physical MMIO address.

Signed-off-by: Jean-Philippe Brucker 
---
 include/uapi/linux/virtio_iommu.h | 44 +++
 1 file changed, 44 insertions(+)

diff --git a/include/uapi/linux/virtio_iommu.h 
b/include/uapi/linux/virtio_iommu.h
index 237e36a280cb..70cba30644d5 100644
--- a/include/uapi/linux/virtio_iommu.h
+++ b/include/uapi/linux/virtio_iommu.h
@@ -16,6 +16,7 @@
 #define VIRTIO_IOMMU_F_BYPASS  3
 #define VIRTIO_IOMMU_F_PROBE   4
 #define VIRTIO_IOMMU_F_MMIO5
+#define VIRTIO_IOMMU_F_TOPOLOGY6
 
 struct virtio_iommu_range_64 {
__le64  start;
@@ -27,6 +28,17 @@ struct virtio_iommu_range_32 {
__le32  end;
 };
 
+struct virtio_iommu_topo_config {
+   /* Number of topology description structures */
+   __le16  count;
+   /*
+* Offset to the first topology description structure
+* (virtio_iommu_topo_*) from the start of the virtio_iommu config
+* space. Aligned on 8 bytes.
+*/
+   __le16  offset;
+};
+
 struct virtio_iommu_config {
/* Supported page sizes */
__le64  page_size_mask;
@@ -36,6 +48,38 @@ struct virtio_iommu_config {
struct virtio_iommu_range_32domain_range;
/* Probe buffer size */
__le32  probe_size;
+   struct virtio_iommu_topo_config topo_config;
+};
+
+#define VIRTIO_IOMMU_TOPO_PCI_RANGE0x1
+#define VIRTIO_IOMMU_TOPO_MMIO 0x2
+
+struct virtio_iommu_topo_pci_range {
+   /* VIRTIO_IOMMU_TOPO_PCI_RANGE */
+   __u8type;
+   __u8reserved;
+   /* Length of this structure */
+   __le16  length;
+   /* First endpoint ID in the range */
+   __le32  endpoint_start;
+   /* PCI domain number */
+   __le16  segment;
+   /* PCI Bus:Device.Function range */
+   __le16  bdf_start;
+   __le16  bdf_end;
+   __le16  padding;
+};
+
+struct virtio_iommu_topo_mmio {
+   /* VIRTIO_IOMMU_TOPO_MMIO */
+   __u8type;
+   __u8reserved;
+   /* Length of this structure */
+   __le16  length;
+   /* Endpoint ID */
+   __le32  endpoint;
+   /* Address of the first MMIO region */
+   __le64  address;
 };
 
 /* Request types */
-- 
2.28.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v6 1/2] dma-contiguous: provide the ability to reserve per-numa CMA

2020-08-21 Thread Will Deacon

On Fri, Aug 21, 2020 at 09:13:39AM +, Song Bao Hua (Barry Song) wrote:
> 
> 
> > -Original Message-
> > From: Will Deacon [mailto:w...@kernel.org]
> > Sent: Friday, August 21, 2020 8:47 PM
> > To: Song Bao Hua (Barry Song) 
> > Cc: h...@lst.de; m.szyprow...@samsung.com; robin.mur...@arm.com;
> > ganapatrao.kulka...@cavium.com; catalin.mari...@arm.com;
> > iommu@lists.linux-foundation.org; Linuxarm ;
> > linux-arm-ker...@lists.infradead.org; linux-ker...@vger.kernel.org;
> > huangdaode ; Jonathan Cameron
> > ; Nicolas Saenz Julienne
> > ; Steve Capper ; Andrew
> > Morton ; Mike Rapoport 
> > Subject: Re: [PATCH v6 1/2] dma-contiguous: provide the ability to reserve
> > per-numa CMA
> > 
> > On Fri, Aug 21, 2020 at 02:26:14PM +1200, Barry Song wrote:
> > > diff --git a/Documentation/admin-guide/kernel-parameters.txt
> > b/Documentation/admin-guide/kernel-parameters.txt
> > > index bdc1f33fd3d1..3f33b89aeab5 100644
> > > --- a/Documentation/admin-guide/kernel-parameters.txt
> > > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > > @@ -599,6 +599,15 @@
> > >   altogether. For more information, see
> > >   include/linux/dma-contiguous.h
> > >
> > > + pernuma_cma=nn[MG]
> > > + [ARM64,KNL]
> > > + Sets the size of kernel per-numa memory area for
> > > + contiguous memory allocations. A value of 0 disables
> > > + per-numa CMA altogether. DMA users on node nid will
> > > + first try to allocate buffer from the pernuma area
> > > + which is located in node nid, if the allocation fails,
> > > + they will fallback to the global default memory area.
> > 
> > What is the default behaviour if this option is not specified? Seems like
> > that should be mentioned here.

Just wanted to make sure you didn't miss this ^^

> > 
> > > diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
> > > index 847a9d1fa634..db7a37ed35eb 100644
> > > --- a/kernel/dma/Kconfig
> > > +++ b/kernel/dma/Kconfig
> > > @@ -118,6 +118,16 @@ config DMA_CMA
> > > If unsure, say "n".
> > >
> > >  if  DMA_CMA
> > > +
> > > +config DMA_PERNUMA_CMA
> > > + bool "Enable separate DMA Contiguous Memory Area for each NUMA
> > Node"
> > 
> > I don't understand the need for this config option. If you have DMA_DMA and
> > you have NUMA, why wouldn't you want this enabled?
> 
> Christoph preferred this in previous patchset in order to be able to remove 
> all of the code
> in the text if users don't use pernuma CMA.

Ok, I defer to Christoph here, but maybe a "default NUMA" might work?

> > > + help
> > > +   Enable this option to get pernuma CMA areas so that devices like
> > > +   ARM64 SMMU can get local memory by DMA coherent APIs.
> > > +
> > > +   You can set the size of pernuma CMA by specifying
> > "pernuma_cma=size"
> > > +   on the kernel's command line.
> > > +
> > >  comment "Default contiguous memory area size:"
> > >
> > >  config CMA_SIZE_MBYTES
> > > diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c
> > > index cff7e60968b9..89b95f10e56d 100644
> > > --- a/kernel/dma/contiguous.c
> > > +++ b/kernel/dma/contiguous.c
> > > @@ -69,6 +69,19 @@ static int __init early_cma(char *p)
> > >  }
> > >  early_param("cma", early_cma);
> > >
> > > +#ifdef CONFIG_DMA_PERNUMA_CMA
> > > +
> > > +static struct cma *dma_contiguous_pernuma_area[MAX_NUMNODES];
> > > +static phys_addr_t pernuma_size_bytes __initdata;
> > > +
> > > +static int __init early_pernuma_cma(char *p)
> > > +{
> > > + pernuma_size_bytes = memparse(p, );
> > > + return 0;
> > > +}
> > > +early_param("pernuma_cma", early_pernuma_cma);
> > > +#endif
> > > +
> > >  #ifdef CONFIG_CMA_SIZE_PERCENTAGE
> > >
> > >  static phys_addr_t __init __maybe_unused
> > cma_early_percent_memory(void)
> > > @@ -96,6 +109,34 @@ static inline __maybe_unused phys_addr_t
> > cma_early_percent_memory(void)
> > >
> > >  #endif
> > >
> > > +#ifdef CONFIG_DMA_PERNUMA_CMA
> > > +void __init dma_pernuma_cma_reserve(void)
> > > +{
> > > + int nid;
> > > +
> > > + if (!pernuma_size_bytes)
> > > + return;
> > 
> > If this is useful (I assume it is), then I think we should have a non-zero
> > default value, a bit like normal CMA does via CMA_SIZE_MBYTES.
> 
> The patchet used to have a CONFIG_PERNUMA_CMA_SIZE in kernel/dma/Kconfig,
> but Christoph was not comfortable with it:
> https://lore.kernel.org/linux-iommu/20200728115231.ga...@lst.de/
> 
> Would you mind to hardcode the value in CONFIG_CMDLINE in arch/arm64/Kconfig 
> as Christoph mentioned:
> config CMDLINE
>   default "pernuma_cma=16M"
> 
> If you also don't like the change in arch/arm64/Kconfig CMDLINE, I guess I
> have to depend on users' setting in cmdline just like hugetlb_cma.

Again, I defere to CHristophe for this code, so leave it like it is.
However, the same argument applies to CMA_SIZE_MBYTES afaict, and I'm mainly
looking for consistency.

> > > +

RE: [PATCH v6 1/2] dma-contiguous: provide the ability to reserve per-numa CMA

2020-08-21 Thread Song Bao Hua (Barry Song)




> -Original Message-
> From: Will Deacon [mailto:w...@kernel.org]
> Sent: Friday, August 21, 2020 9:27 PM
> To: Song Bao Hua (Barry Song) 
> Cc: h...@lst.de; m.szyprow...@samsung.com; robin.mur...@arm.com;
> ganapatrao.kulka...@cavium.com; catalin.mari...@arm.com;
> iommu@lists.linux-foundation.org; Linuxarm ;
> linux-arm-ker...@lists.infradead.org; linux-ker...@vger.kernel.org;
> huangdaode ; Jonathan Cameron
> ; Nicolas Saenz Julienne
> ; Steve Capper ; Andrew
> Morton ; Mike Rapoport 
> Subject: Re: [PATCH v6 1/2] dma-contiguous: provide the ability to reserve
> per-numa CMA
> 
> On Fri, Aug 21, 2020 at 09:13:39AM +, Song Bao Hua (Barry Song) wrote:
> >
> >
> > > -Original Message-
> > > From: Will Deacon [mailto:w...@kernel.org]
> > > Sent: Friday, August 21, 2020 8:47 PM
> > > To: Song Bao Hua (Barry Song) 
> > > Cc: h...@lst.de; m.szyprow...@samsung.com; robin.mur...@arm.com;
> > > ganapatrao.kulka...@cavium.com; catalin.mari...@arm.com;
> > > iommu@lists.linux-foundation.org; Linuxarm ;
> > > linux-arm-ker...@lists.infradead.org; linux-ker...@vger.kernel.org;
> > > huangdaode ; Jonathan Cameron
> > > ; Nicolas Saenz Julienne
> > > ; Steve Capper ;
> > > Andrew Morton ; Mike Rapoport
> > > 
> > > Subject: Re: [PATCH v6 1/2] dma-contiguous: provide the ability to
> > > reserve per-numa CMA
> > >
> > > On Fri, Aug 21, 2020 at 02:26:14PM +1200, Barry Song wrote:
> > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt
> > > b/Documentation/admin-guide/kernel-parameters.txt
> > > > index bdc1f33fd3d1..3f33b89aeab5 100644
> > > > --- a/Documentation/admin-guide/kernel-parameters.txt
> > > > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > > > @@ -599,6 +599,15 @@
> > > > altogether. For more information, see
> > > > include/linux/dma-contiguous.h
> > > >
> > > > +   pernuma_cma=nn[MG]
> > > > +   [ARM64,KNL]
> > > > +   Sets the size of kernel per-numa memory area for
> > > > +   contiguous memory allocations. A value of 0 
> > > > disables
> > > > +   per-numa CMA altogether. DMA users on node nid 
> > > > will
> > > > +   first try to allocate buffer from the pernuma 
> > > > area
> > > > +   which is located in node nid, if the allocation 
> > > > fails,
> > > > +   they will fallback to the global default memory 
> > > > area.
> > >
> > > What is the default behaviour if this option is not specified? Seems
> > > like that should be mentioned here.
> 
> Just wanted to make sure you didn't miss this ^^

If it is not specified, the default size is 0 that means pernuma_cma is 
disabled.

Will put some words for this.

> 
> > >
> > > > diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig index
> > > > 847a9d1fa634..db7a37ed35eb 100644
> > > > --- a/kernel/dma/Kconfig
> > > > +++ b/kernel/dma/Kconfig
> > > > @@ -118,6 +118,16 @@ config DMA_CMA
> > > >   If unsure, say "n".
> > > >
> > > >  if  DMA_CMA
> > > > +
> > > > +config DMA_PERNUMA_CMA
> > > > +   bool "Enable separate DMA Contiguous Memory Area for each
> NUMA
> > > Node"
> > >
> > > I don't understand the need for this config option. If you have
> > > DMA_DMA and you have NUMA, why wouldn't you want this enabled?
> >
> > Christoph preferred this in previous patchset in order to be able to
> > remove all of the code in the text if users don't use pernuma CMA.
> 
> Ok, I defer to Christoph here, but maybe a "default NUMA" might work?

maybe "default NUMA && ARM64"?
Though I believe it will benefit x86, but I don't have a x86 server hardware
and real scenario to test. So I haven't put the dma_pernuma_cma_reserve()
code in arch/x86.
Hopefully some x86 guys will bring it up and remove the "&& ARM64".

> 
> > > > +   help
> > > > + Enable this option to get pernuma CMA areas so that devices 
> > > > like
> > > > + ARM64 SMMU can get local memory by DMA coherent APIs.
> > > > +
> > > > + You can set the size of pernuma CMA by specifying
> > > "pernuma_cma=size"
> > > > + on the kernel's command line.
> > > > +
> > > >  comment "Default contiguous memory area size:"
> > > >
> > > >  config CMA_SIZE_MBYTES
> > > > diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c
> > > > index cff7e60968b9..89b95f10e56d 100644
> > > > --- a/kernel/dma/contiguous.c
> > > > +++ b/kernel/dma/contiguous.c
> > > > @@ -69,6 +69,19 @@ static int __init early_cma(char *p)  }
> > > > early_param("cma", early_cma);
> > > >
> > > > +#ifdef CONFIG_DMA_PERNUMA_CMA
> > > > +
> > > > +static struct cma *dma_contiguous_pernuma_area[MAX_NUMNODES];
> > > > +static phys_addr_t pernuma_size_bytes __initdata;
> > > > +
> > > > +static int __init early_pernuma_cma(char *p) {
> > > > +   pernuma_size_bytes = memparse(p, );
> > > > +   return 0;
> > > > +}
> > > > +early_param("pernuma_cma",

[PATCH v7 2/3] arm64: mm: reserve per-numa CMA to localize coherent dma buffers

2020-08-21 Thread Barry Song

Right now, smmu is using dma_alloc_coherent() to get memory to save queues
and tables. Typically, on ARM64 server, there is a default CMA located at
node0, which could be far away from node2, node3 etc.
with this patch, smmu will get memory from local numa node to save command
queues and page tables. that means dma_unmap latency will be shrunk much.
Meanwhile, when iommu.passthrough is on, device drivers which call dma_
alloc_coherent() will also get local memory and avoid the travel between
numa nodes.

Acked-by: Will Deacon 
Cc: Christoph Hellwig 
Cc: Marek Szyprowski 
Cc: Robin Murphy 
Cc: Ganapatrao Kulkarni 
Cc: Catalin Marinas 
Cc: Nicolas Saenz Julienne 
Cc: Steve Capper 
Cc: Andrew Morton 
Cc: Mike Rapoport 
Signed-off-by: Barry Song 
---
 -v7: add Will's acked-by

 arch/arm64/mm/init.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 481d22c32a2e..f1c75957ff3c 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -429,6 +429,8 @@ void __init bootmem_init(void)
arm64_hugetlb_cma_reserve();
 #endif
 
+   dma_pernuma_cma_reserve();
+
/*
 * sparse_init() tries to allocate memory from memblock, so must be
 * done after the fixed reservations
-- 
2.27.0


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v7 0/3] make dma_alloc_coherent NUMA-aware by per-NUMA CMA

2020-08-21 Thread Mike Kravetz

Hi Barry,
Sorry for jumping in so late.

On 8/21/20 4:33 AM, Barry Song wrote:
> 
> with per-numa CMA, smmu will get memory from local numa node to save command
> queues and page tables. that means dma_unmap latency will be shrunk much.

Since per-node CMA areas for hugetlb was introduced, I have been thinking
about the limited number of CMA areas.  In most configurations, I believe
it is limited to 7.  And, IIRC it is not something that can be changed at
runtime, you need to reconfig and rebuild to increase the number.  In contrast
some configs have NODES_SHIFT set to 10.  I wasn't too worried because of
the limited hugetlb use case.  However, this series is adding another user
of per-node CMA areas.

With more users, should try to sync up number of CMA areas and number of
nodes?  Or, perhaps I am worrying about nothing?
-- 
Mike Kravetz
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v7 0/3] make dma_alloc_coherent NUMA-aware by per-NUMA CMA

2020-08-21 Thread Mike Kravetz

On 8/21/20 1:47 PM, Song Bao Hua (Barry Song) wrote:
> 
> 
>> -Original Message-
>> From: Song Bao Hua (Barry Song)
>> Sent: Saturday, August 22, 2020 7:27 AM
>> To: 'Mike Kravetz' ; h...@lst.de;
>> m.szyprow...@samsung.com; robin.mur...@arm.com; w...@kernel.org;
>> ganapatrao.kulka...@cavium.com; catalin.mari...@arm.com;
>> a...@linux-foundation.org
>> Cc: iommu@lists.linux-foundation.org; linux-arm-ker...@lists.infradead.org;
>> linux-ker...@vger.kernel.org; Zengtao (B) ;
>> huangdaode ; Linuxarm 
>> Subject: RE: [PATCH v7 0/3] make dma_alloc_coherent NUMA-aware by
>> per-NUMA CMA
>>
>>
>>
>>> -Original Message-
>>> From: Mike Kravetz [mailto:mike.krav...@oracle.com]
>>> Sent: Saturday, August 22, 2020 5:53 AM
>>> To: Song Bao Hua (Barry Song) ; h...@lst.de;
>>> m.szyprow...@samsung.com; robin.mur...@arm.com; w...@kernel.org;
>>> ganapatrao.kulka...@cavium.com; catalin.mari...@arm.com;
>>> a...@linux-foundation.org
>>> Cc: iommu@lists.linux-foundation.org; linux-arm-ker...@lists.infradead.org;
>>> linux-ker...@vger.kernel.org; Zengtao (B) ;
>>> huangdaode ; Linuxarm
>> 
>>> Subject: Re: [PATCH v7 0/3] make dma_alloc_coherent NUMA-aware by
>>> per-NUMA CMA
>>>
>>> Hi Barry,
>>> Sorry for jumping in so late.
>>>
>>> On 8/21/20 4:33 AM, Barry Song wrote:

 with per-numa CMA, smmu will get memory from local numa node to save
>>> command
 queues and page tables. that means dma_unmap latency will be shrunk
>>> much.
>>>
>>> Since per-node CMA areas for hugetlb was introduced, I have been thinking
>>> about the limited number of CMA areas.  In most configurations, I believe
>>> it is limited to 7.  And, IIRC it is not something that can be changed at
>>> runtime, you need to reconfig and rebuild to increase the number.  In
>> contrast
>>> some configs have NODES_SHIFT set to 10.  I wasn't too worried because of
>>> the limited hugetlb use case.  However, this series is adding another user
>>> of per-node CMA areas.
>>>
>>> With more users, should try to sync up number of CMA areas and number of
>>> nodes?  Or, perhaps I am worrying about nothing?
>>
>> Hi Mike,
>> The current limitation is 8. If the server has 4 nodes and we enable both
>> pernuma
>> CMA and hugetlb, the last node will fail to get one cma area as the default
>> global cma area will take 1 of 8. So users need to change menuconfig.
>> If the server has 8 nodes, we enable one of pernuma cma and hugetlb, one
>> node
>> will fail to get cma.
>>
>> We may set the default number of CMA areas as 8+MAX_NODES(if hugetlb
>> enabled) +
>> MAX_NODES(if pernuma cma enabled) if we don't expect users to change
>> config, but
>> right now hugetlb has not an option in Kconfig to enable or disable like
>> pernuma cma
>> has DMA_PERNUMA_CMA.
> 
> I would prefer we make some changes like:
> 
> config CMA_AREAS
>   int "Maximum count of the CMA areas"
>   depends on CMA
> + default 19 if NUMA
>   default 7
>   help
> CMA allows to create CMA areas for particular purpose, mainly,
> used as device private area. This parameter sets the maximum
> number of CMA area in the system.
> 
> -   If unsure, leave the default value "7".
> +   If unsure, leave the default value "7" or "19" if NUMA is used.
> 
> 1+ CONFIG_CMA_AREAS should be quite enough for almost all servers in the 
> markets.
> 
> If 2 numa nodes, and both hugetlb cma and pernuma cma is enabled, we need 2*2 
> + 1 = 5
> If 4 numa nodes, and both hugetlb cma and pernuma cma is enabled, we need 2*4 
> + 1 = 9-> default ARM64 config.
> If 8 numa nodes, and both hugetlb cma and pernuma cma is enabled, we need 2*8 
> + 1 = 17
> 
> The default value is supporting the most common case and is not going to 
> support those servers
> with NODES_SHIFT=10, they can make their own config just like users need to 
> increase CMA_AREAS
> if they add many cma areas in device tree in a system even without NUMA.
> 
> How do you think, mike?

I'm OK with that.  I really did not want to sidetrach this series.  It is
just something I thought about when looking at the hugetlb code.  My 'to do'
list includes looking at a way to make the number of CMA areas dynamic.
-- 
Mike Kravetz
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [patch RFC 38/38] irqchip: Add IMS array driver - NOT FOR MERGING

2020-08-21 Thread Jason Gunthorpe

On Fri, Aug 21, 2020 at 09:47:43PM +0200, Thomas Gleixner wrote:
> On Fri, Aug 21 2020 at 09:45, Jason Gunthorpe wrote:
> > On Fri, Aug 21, 2020 at 02:25:02AM +0200, Thomas Gleixner wrote:
> >> +static void ims_mask_irq(struct irq_data *data)
> >> +{
> >> +  struct msi_desc *desc = irq_data_get_msi_desc(data);
> >> +  struct ims_array_slot __iomem *slot = desc->device_msi.priv_iomem;
> >> +  u32 __iomem *ctrl = >ctrl;
> >> +
> >> +  iowrite32(ioread32(ctrl) & ~IMS_VECTOR_CTRL_UNMASK, ctrl);
> >
> > Just to be clear, this is exactly the sort of operation we can't do
> > with non-MSI interrupts. For a real PCI device to execute this it
> > would have to keep the data on die.
> 
> We means NVIDIA and your new device, right?

We'd like to use this in the current Mellanox NIC HW, eg the mlx5
driver. (NVIDIA acquired Mellanox recently)

> So if I understand correctly then the queue memory where the MSI
> descriptor sits is in RAM.

Yes, IMHO that is the whole point of this 'IMS' stuff. If devices
could have enough on-die memory then they could just use really big
MSI-X tables. Currently due to on-die memory constraints mlx5 is
limited to a few hundred MSI-X vectors.

Since MSI-X tables are exposed via MMIO they can't be 'swapped' to
RAM.

Moving away from MSI-X's MMIO access model allows them to be swapped
to RAM. The cost is that accessing them for update is a
command/response operation not a MMIO operation.

The HW is already swapping the queues causing the interrupts to RAM,
so adding a bit of additional data to store the MSI addr/data is
reasonable.

To give some sense, a 'working set' for the NIC device in some cases
can be hundreds of megabytes of data. System RAM is used to store
this, and precious on-die memory holds some dynamic active set, much
like a processor cache.

> How is that supposed to work if interrupt remapping is disabled?

The best we can do is issue a command to the device and spin/sleep
until completion. The device will serialize everything internally.

If the device has died the driver has code to detect and trigger a
PCI function reset which will definitely stop the interrupt.

So, the implementation of these functions would be to push any change
onto a command queue, trigger the device to DMA the command, spin/sleep
until the device returns a response and then continue on. If the
device doesn't return a response in a time window then trigger a WQ to
do a full device reset.

The spin/sleep is only needed if the update has to be synchronous, so
things like rebalancing could just push the rebalancing work and
immediately return.

> If interrupt remapping is enabled then both are trivial because then the
> irq chip can delegate everything to the parent chip, i.e. the remapping
> unit.

I did like this notion that IRQ remapping could avoid the overhead of
spin/spleep. Most of the use cases we have for this will require the
IOMMU anyhow.

> > I saw the idxd driver was doing something like this, I assume it
> > avoids trouble because it is a fake PCI device integrated with the
> > CPU, not on a real PCI bus?
> 
> That's how it is implemented as far as I understood the patches. It's
> device memory therefore iowrite32().

I don't know anything about idxd.. Given the scale of interrupt need I
assumed the idxd HW had some hidden swapping to RAM. 

Since it is on-die with the CPU there are a bunch of ways I could
imagine Intel could make MMIO triggered swapping work that are not
available to a true PCI-E device.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [patch RFC 38/38] irqchip: Add IMS array driver - NOT FOR MERGING

2020-08-21 Thread Thomas Gleixner

On Fri, Aug 21 2020 at 17:17, Jason Gunthorpe wrote:
> On Fri, Aug 21, 2020 at 09:47:43PM +0200, Thomas Gleixner wrote:
>> So if I understand correctly then the queue memory where the MSI
>> descriptor sits is in RAM.
>
> Yes, IMHO that is the whole point of this 'IMS' stuff. If devices
> could have enough on-die memory then they could just use really big
> MSI-X tables. Currently due to on-die memory constraints mlx5 is
> limited to a few hundred MSI-X vectors.

Right, that's the limit of a particular device, but nothing prevents you
to have a larger table on a new device.

The MSI-X limitation to 2048 is defined by the PCI spec and you'd need
either some non spec compliant abuse of the reserved size bits or some
extra config entry. So IMS is a way to work around that. But I
understand why you want to move them to main memory, but you have to
deal with the problems this creates upfront.

> Since MSI-X tables are exposed via MMIO they can't be 'swapped' to
> RAM.
>
> Moving away from MSI-X's MMIO access model allows them to be swapped
> to RAM. The cost is that accessing them for update is a
> command/response operation not a MMIO operation.
>
> The HW is already swapping the queues causing the interrupts to RAM,
> so adding a bit of additional data to store the MSI addr/data is
> reasonable.

Makes sense.

>> How is that supposed to work if interrupt remapping is disabled?
>
> The best we can do is issue a command to the device and spin/sleep
> until completion. The device will serialize everything internally.
>
> If the device has died the driver has code to detect and trigger a
> PCI function reset which will definitely stop the interrupt.

If that interrupt is gone into storm mode for some reason then this will
render your machine unusable before you can do that.

> So, the implementation of these functions would be to push any change
> onto a command queue, trigger the device to DMA the command, spin/sleep
> until the device returns a response and then continue on. If the
> device doesn't return a response in a time window then trigger a WQ to
> do a full device reset.

I really don't want to do that with the irq descriptor lock held or in
case of affinity from the interrupt handler as we have to do with PCI
MSI/MSI-X due to the horrors of the X86 interrupt delivery trainwreck.
Also you cannot call into command queue code from interrupt disabled and
interrupt descriptor lock held sections. You can try, but lockdep will
yell at you immediately. 

There is also CPU hotplug where we have to force migrate an interrupt
away from an outgoing CPU. This needs some serious thought.

One question is whether the device can see partial updates to that
memory due to the async 'swap' of context from the device CPU.

So we have address_lo, address_hi, data and ctrl. Each of them 32 bit.

address_hi is only relevant when the number of CPUs is > 255 which
requires X2APIC which in turn requires interrupt remapping. For all
others the address_hi value never changes. Let's ignore that case for
now, but see further down.

So what's interesting is address_lo and data. If the device sees an
partial update, i.e. address_lo is written and the device grabs the
update before data is written then the next interrupt will end up in
lala land. We have code for that in place in msi_set_affinity() in
arch/x86/kernel/apic/msi.c. Get eyecancer protection glasses before
opening that and keep beer ready to wipe out the horrors immediately
afterwards.

If the device updates the data only when a command is issued then this
is not a problem, but it causes other problems because you still cannot
access the command queue from that context. This makes it even worse for
the CPU hotplug case. But see all of the reasoning on that.

If it takes whatever it sees while grabbing the state when switching to
a different queue or at the point of actual interrupt delivery, then you
have a problem. Not only you, I'm going to have one as well because I'm
going to be the poor sod to come up with the workaround.

So we better address that _before_ you start deploying this piece of
art. I'm not really interested in another slighly different and probably
more horrible version of the same story. Don't blame me, it's the way
how Intel decided to make this "work".

There are a couple of options to ensure that the device will never see
inconsistent state:

  1) Use a locked 16 byte wide operation (cpmxchg16) which is not
 available on 32bit

  2) Order the MSG entry differently in the queue storage:

 u32 address_lo
 u32 data
 u32 address_hi
 u32 ctrl

 And then enforce an 8 byte store on 64 bit which is guaranteed
 to be atomic vs. other CPUs and bus agents, i.e. DMA.

 I said enforce because compilers are known to do stupid things.

Both are fine for me and the only caveat is that the access does not go
accross a cache line boundary. The restriction to 64bit shouldn't be a
problem either.

Re: [PATCH v4 07/15] iommu/vt-d: Delegate the dma domain to upper layer

2020-08-21 Thread Chris Wilson

Quoting Lu Baolu (2019-05-25 06:41:28)
> This allows the iommu generic layer to allocate a dma domain and
> attach it to a device through the iommu api's. With all types of
> domains being delegated to upper layer, we can remove an internal
> flag which was used to distinguish domains mananged internally or
> externally.

I'm seeing some really strange behaviour with this patch on a 32b
Skylake system (and still present on mainline). Before this patch
everything is peaceful and appears to work correctly. Applying this patch,
and we fail to initialise the GPU with a few DMAR errors reported, e.g.

[   20.279445] DMAR: DRHD: handling fault status reg 3
[   20.279508] DMAR: [DMA Read] Request device [00:02.0] fault addr 8900a000 
[fault reason 05] PTE Write access is not set

Setting an identity map for the igfx made the DMAR errors disappear, but
the GPU still failed to initialise.

There's no difference in the DMAR configuration dmesg between working and
the upset patch. And the really strange part is that switching to a 64b
kernel with this patch, it's working.

Any suggestions on what I should look for?
-Chris
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [PATCH v7 0/3] make dma_alloc_coherent NUMA-aware by per-NUMA CMA

2020-08-21 Thread Song Bao Hua (Barry Song)




> -Original Message-
> From: Mike Kravetz [mailto:mike.krav...@oracle.com]
> Sent: Saturday, August 22, 2020 5:53 AM
> To: Song Bao Hua (Barry Song) ; h...@lst.de;
> m.szyprow...@samsung.com; robin.mur...@arm.com; w...@kernel.org;
> ganapatrao.kulka...@cavium.com; catalin.mari...@arm.com;
> a...@linux-foundation.org
> Cc: iommu@lists.linux-foundation.org; linux-arm-ker...@lists.infradead.org;
> linux-ker...@vger.kernel.org; Zengtao (B) ;
> huangdaode ; Linuxarm 
> Subject: Re: [PATCH v7 0/3] make dma_alloc_coherent NUMA-aware by
> per-NUMA CMA
> 
> Hi Barry,
> Sorry for jumping in so late.
> 
> On 8/21/20 4:33 AM, Barry Song wrote:
> >
> > with per-numa CMA, smmu will get memory from local numa node to save
> command
> > queues and page tables. that means dma_unmap latency will be shrunk
> much.
> 
> Since per-node CMA areas for hugetlb was introduced, I have been thinking
> about the limited number of CMA areas.  In most configurations, I believe
> it is limited to 7.  And, IIRC it is not something that can be changed at
> runtime, you need to reconfig and rebuild to increase the number.  In contrast
> some configs have NODES_SHIFT set to 10.  I wasn't too worried because of
> the limited hugetlb use case.  However, this series is adding another user
> of per-node CMA areas.
> 
> With more users, should try to sync up number of CMA areas and number of
> nodes?  Or, perhaps I am worrying about nothing?

Hi Mike,
The current limitation is 8. If the server has 4 nodes and we enable both 
pernuma
CMA and hugetlb, the last node will fail to get one cma area as the default
global cma area will take 1 of 8. So users need to change menuconfig.
If the server has 8 nodes, we enable one of pernuma cma and hugetlb, one node
will fail to get cma.

We may set the default number of CMA areas as 8+MAX_NODES(if hugetlb enabled) +
MAX_NODES(if pernuma cma enabled) if we don't expect users to change config, but
right now hugetlb has not an option in Kconfig to enable or disable like 
pernuma cma
has DMA_PERNUMA_CMA.

> --
> Mike Kravetz

Thanks
Barry
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [PATCH v7 0/3] make dma_alloc_coherent NUMA-aware by per-NUMA CMA

2020-08-21 Thread Song Bao Hua (Barry Song)




> -Original Message-
> From: Song Bao Hua (Barry Song)
> Sent: Saturday, August 22, 2020 7:27 AM
> To: 'Mike Kravetz' ; h...@lst.de;
> m.szyprow...@samsung.com; robin.mur...@arm.com; w...@kernel.org;
> ganapatrao.kulka...@cavium.com; catalin.mari...@arm.com;
> a...@linux-foundation.org
> Cc: iommu@lists.linux-foundation.org; linux-arm-ker...@lists.infradead.org;
> linux-ker...@vger.kernel.org; Zengtao (B) ;
> huangdaode ; Linuxarm 
> Subject: RE: [PATCH v7 0/3] make dma_alloc_coherent NUMA-aware by
> per-NUMA CMA
> 
> 
> 
> > -Original Message-
> > From: Mike Kravetz [mailto:mike.krav...@oracle.com]
> > Sent: Saturday, August 22, 2020 5:53 AM
> > To: Song Bao Hua (Barry Song) ; h...@lst.de;
> > m.szyprow...@samsung.com; robin.mur...@arm.com; w...@kernel.org;
> > ganapatrao.kulka...@cavium.com; catalin.mari...@arm.com;
> > a...@linux-foundation.org
> > Cc: iommu@lists.linux-foundation.org; linux-arm-ker...@lists.infradead.org;
> > linux-ker...@vger.kernel.org; Zengtao (B) ;
> > huangdaode ; Linuxarm
> 
> > Subject: Re: [PATCH v7 0/3] make dma_alloc_coherent NUMA-aware by
> > per-NUMA CMA
> >
> > Hi Barry,
> > Sorry for jumping in so late.
> >
> > On 8/21/20 4:33 AM, Barry Song wrote:
> > >
> > > with per-numa CMA, smmu will get memory from local numa node to save
> > command
> > > queues and page tables. that means dma_unmap latency will be shrunk
> > much.
> >
> > Since per-node CMA areas for hugetlb was introduced, I have been thinking
> > about the limited number of CMA areas.  In most configurations, I believe
> > it is limited to 7.  And, IIRC it is not something that can be changed at
> > runtime, you need to reconfig and rebuild to increase the number.  In
> contrast
> > some configs have NODES_SHIFT set to 10.  I wasn't too worried because of
> > the limited hugetlb use case.  However, this series is adding another user
> > of per-node CMA areas.
> >
> > With more users, should try to sync up number of CMA areas and number of
> > nodes?  Or, perhaps I am worrying about nothing?
> 
> Hi Mike,
> The current limitation is 8. If the server has 4 nodes and we enable both
> pernuma
> CMA and hugetlb, the last node will fail to get one cma area as the default
> global cma area will take 1 of 8. So users need to change menuconfig.
> If the server has 8 nodes, we enable one of pernuma cma and hugetlb, one
> node
> will fail to get cma.
> 
> We may set the default number of CMA areas as 8+MAX_NODES(if hugetlb
> enabled) +
> MAX_NODES(if pernuma cma enabled) if we don't expect users to change
> config, but
> right now hugetlb has not an option in Kconfig to enable or disable like
> pernuma cma
> has DMA_PERNUMA_CMA.

I would prefer we make some changes like:

config CMA_AREAS
int "Maximum count of the CMA areas"
depends on CMA
+   default 19 if NUMA
default 7
help
  CMA allows to create CMA areas for particular purpose, mainly,
  used as device private area. This parameter sets the maximum
  number of CMA area in the system.

- If unsure, leave the default value "7".
+ If unsure, leave the default value "7" or "19" if NUMA is used.

1+ CONFIG_CMA_AREAS should be quite enough for almost all servers in the 
markets.

If 2 numa nodes, and both hugetlb cma and pernuma cma is enabled, we need 2*2 + 
1 = 5
If 4 numa nodes, and both hugetlb cma and pernuma cma is enabled, we need 2*4 + 
1 = 9-> default ARM64 config.
If 8 numa nodes, and both hugetlb cma and pernuma cma is enabled, we need 2*8 + 
1 = 17

The default value is supporting the most common case and is not going to 
support those servers
with NODES_SHIFT=10, they can make their own config just like users need to 
increase CMA_AREAS
if they add many cma areas in device tree in a system even without NUMA.

How do you think, mike?

Thanks
Barry
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [PATCH v7 1/3] dma-contiguous: provide the ability to reserve per-numa CMA

2020-08-21 Thread Song Bao Hua (Barry Song)




> -Original Message-
> From: Mike Rapoport [mailto:r...@linux.ibm.com]
> Sent: Saturday, August 22, 2020 2:28 AM
> To: Song Bao Hua (Barry Song) 
> Cc: h...@lst.de; m.szyprow...@samsung.com; robin.mur...@arm.com;
> w...@kernel.org; ganapatrao.kulka...@cavium.com;
> catalin.mari...@arm.com; a...@linux-foundation.org;
> iommu@lists.linux-foundation.org; linux-arm-ker...@lists.infradead.org;
> linux-ker...@vger.kernel.org; Zengtao (B) ;
> huangdaode ; Linuxarm ;
> Jonathan Cameron ; Nicolas Saenz Julienne
> ; Steve Capper 
> Subject: Re: [PATCH v7 1/3] dma-contiguous: provide the ability to reserve
> per-numa CMA
> 
> On Fri, Aug 21, 2020 at 11:33:53PM +1200, Barry Song wrote:
> > Right now, drivers like ARM SMMU are using dma_alloc_coherent() to get
> > coherent DMA buffers to save their command queues and page tables. As
> > there is only one default CMA in the whole system, SMMUs on nodes other
> > than node0 will get remote memory. This leads to significant latency.
> >
> > This patch provides per-numa CMA so that drivers like SMMU can get local
> > memory. Tests show localizing CMA can decrease dma_unmap latency much.
> > For instance, before this patch, SMMU on node2  has to wait for more than
> > 560ns for the completion of CMD_SYNC in an empty command queue; with
> this
> > patch, it needs 240ns only.
> >
> > A positive side effect of this patch would be improving performance even
> > further for those users who are worried about performance more than DMA
> > security and use iommu.passthrough=1 to skip IOMMU. With local CMA, all
> > drivers can get local coherent DMA buffers.
> >
> > Cc: Jonathan Cameron 
> > Cc: Christoph Hellwig 
> > Cc: Marek Szyprowski 
> > Cc: Will Deacon 
> > Cc: Robin Murphy 
> > Cc: Ganapatrao Kulkarni 
> > Cc: Catalin Marinas 
> > Cc: Nicolas Saenz Julienne 
> > Cc: Steve Capper 
> > Cc: Andrew Morton 
> > Cc: Mike Rapoport 
> > Signed-off-by: Barry Song 
> > ---
> >  -v7: with respect to Will's comments
> >  * move to use for_each_online_node
> >  * add description if users don't specify pernuma_cma
> >  * provide default value for CONFIG_DMA_PERNUMA_CMA
> >
> >  .../admin-guide/kernel-parameters.txt |  11 ++
> >  include/linux/dma-contiguous.h|   6 ++
> >  kernel/dma/Kconfig|  11 ++
> >  kernel/dma/contiguous.c   | 100
> --
> >  4 files changed, 118 insertions(+), 10 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt
> b/Documentation/admin-guide/kernel-parameters.txt
> > index bdc1f33fd3d1..c609527fc35a 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -599,6 +599,17 @@
> > altogether. For more information, see
> > include/linux/dma-contiguous.h
> >
> > +   pernuma_cma=nn[MG]
> 
> Maybe cma_pernuma or cma_pernode?

Sounds good.

> 
> > +   [ARM64,KNL]
> > +   Sets the size of kernel per-numa memory area for
> > +   contiguous memory allocations. A value of 0 disables
> > +   per-numa CMA altogether. And If this option is not
> > +   specificed, the default value is 0.
> > +   With per-numa CMA enabled, DMA users on node nid will
> > +   first try to allocate buffer from the pernuma area
> > +   which is located in node nid, if the allocation fails,
> > +   they will fallback to the global default memory area.
> > +
> > cmo_free_hint=  [PPC] Format: { yes | no }
> > Specify whether pages are marked as being inactive
> > when they are freed.  This is used in CMO environments
> > diff --git a/include/linux/dma-contiguous.h
> b/include/linux/dma-contiguous.h
> > index 03f8e98e3bcc..fe55e004f1f4 100644
> > --- a/include/linux/dma-contiguous.h
> > +++ b/include/linux/dma-contiguous.h
> > @@ -171,6 +171,12 @@ static inline void dma_free_contiguous(struct
> device *dev, struct page *page,
> >
> >  #endif
> >
> > +#ifdef CONFIG_DMA_PERNUMA_CMA
> > +void dma_pernuma_cma_reserve(void);
> > +#else
> > +static inline void dma_pernuma_cma_reserve(void) { }
> > +#endif
> > +
> >  #endif
> >
> >  #endif
> > diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
> > index 847a9d1fa634..c38979d45b13 100644
> > --- a/kernel/dma/Kconfig
> > +++ b/kernel/dma/Kconfig
> > @@ -118,6 +118,17 @@ config DMA_CMA
> >   If unsure, say "n".
> >
> >  if  DMA_CMA
> > +
> > +config DMA_PERNUMA_CMA
> > +   bool "Enable separate DMA Contiguous Memory Area for each NUMA
> Node"
> > +   default NUMA && ARM64
> > +   help
> > + Enable this option to get pernuma CMA areas so that devices like
> > + ARM64 SMMU can get local memory by DMA coherent APIs.
> > +
> > + You can set the size of pernuma CMA by specifying
> "pernuma_cma=size"
> > + on the kernel's command

Re: [patch RFC 38/38] irqchip: Add IMS array driver - NOT FOR MERGING

2020-08-21 Thread Thomas Gleixner

On Fri, Aug 21 2020 at 09:45, Jason Gunthorpe wrote:
> On Fri, Aug 21, 2020 at 02:25:02AM +0200, Thomas Gleixner wrote:
>> +static void ims_mask_irq(struct irq_data *data)
>> +{
>> +struct msi_desc *desc = irq_data_get_msi_desc(data);
>> +struct ims_array_slot __iomem *slot = desc->device_msi.priv_iomem;
>> +u32 __iomem *ctrl = >ctrl;
>> +
>> +iowrite32(ioread32(ctrl) & ~IMS_VECTOR_CTRL_UNMASK, ctrl);
>
> Just to be clear, this is exactly the sort of operation we can't do
> with non-MSI interrupts. For a real PCI device to execute this it
> would have to keep the data on die.

We means NVIDIA and your new device, right?

So if I understand correctly then the queue memory where the MSI
descriptor sits is in RAM.

How is that supposed to work if interrupt remapping is disabled?

That means irq migration and proper disabling of an interrupt become an
interesting exercise. I'm so not looking forward to that.

If interrupt remapping is enabled then both are trivial because then the
irq chip can delegate everything to the parent chip, i.e. the remapping
unit.

Can you please explain that a bit more precise?

> I saw the idxd driver was doing something like this, I assume it
> avoids trouble because it is a fake PCI device integrated with the
> CPU, not on a real PCI bus?

That's how it is implemented as far as I understood the patches. It's
device memory therefore iowrite32().

> It is really nice to see irq_domain used properly in x86!

If you ignore the abuse in XEN :)

And to be fair proper and usable (hierarchical) irq domains originate
from x86 and happened to solve quite a few horrorshows on the ARM side.

Just back then when we converted the original maze, nobody had a good
idea and the stomach to touch XEN.

Thanks,

tglx
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [PATCH v7 1/3] dma-contiguous: provide the ability to reserve per-numa CMA

2020-08-21 Thread Song Bao Hua (Barry Song)




> -Original Message-
> From: Randy Dunlap [mailto:rdun...@infradead.org]
> Sent: Saturday, August 22, 2020 4:08 AM
> To: Song Bao Hua (Barry Song) ; h...@lst.de;
> m.szyprow...@samsung.com; robin.mur...@arm.com; w...@kernel.org;
> ganapatrao.kulka...@cavium.com; catalin.mari...@arm.com;
> a...@linux-foundation.org
> Cc: iommu@lists.linux-foundation.org; linux-arm-ker...@lists.infradead.org;
> linux-ker...@vger.kernel.org; Zengtao (B) ;
> huangdaode ; Linuxarm ;
> Jonathan Cameron ; Nicolas Saenz Julienne
> ; Steve Capper ; Mike
> Rapoport 
> Subject: Re: [PATCH v7 1/3] dma-contiguous: provide the ability to reserve
> per-numa CMA
> 
> On 8/21/20 4:33 AM, Barry Song wrote:
> > ---
> >  -v7: with respect to Will's comments
> >  * move to use for_each_online_node
> >  * add description if users don't specify pernuma_cma
> >  * provide default value for CONFIG_DMA_PERNUMA_CMA
> >
> >  .../admin-guide/kernel-parameters.txt |  11 ++
> >  include/linux/dma-contiguous.h|   6 ++
> >  kernel/dma/Kconfig|  11 ++
> >  kernel/dma/contiguous.c   | 100
> --
> >  4 files changed, 118 insertions(+), 10 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt
> b/Documentation/admin-guide/kernel-parameters.txt
> > index bdc1f33fd3d1..c609527fc35a 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -599,6 +599,17 @@
> > altogether. For more information, see
> > include/linux/dma-contiguous.h
> >
> > +   pernuma_cma=nn[MG]
> > +   [ARM64,KNL]
> > +   Sets the size of kernel per-numa memory area for
> > +   contiguous memory allocations. A value of 0 disables
> > +   per-numa CMA altogether. And If this option is not
> > +   specificed, the default value is 0.
> > +   With per-numa CMA enabled, DMA users on node nid will
> > +   first try to allocate buffer from the pernuma area
> > +   which is located in node nid, if the allocation fails,
> > +   they will fallback to the global default memory area.
> > +
> 
> Entries in kernel-parameters.txt are supposed to be in alphabetical order
> but this one is not.  If you want to keep it near the cma= entry, you can
> rename it like Mike suggested.  Otherwise it needs to be moved.

As I've replied in Mike's comment, I'd like to rename it to cma_per...

> 
> 
> > cmo_free_hint=  [PPC] Format: { yes | no }
> > Specify whether pages are marked as being inactive
> > when they are freed.  This is used in CMO environments
> 
> 
> 
> --
> ~Randy

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [patch RFC 38/38] irqchip: Add IMS array driver - NOT FOR MERGING

2020-08-21 Thread Thomas Gleixner

Jason,

On Fri, Aug 21 2020 at 21:51, Jason Gunthorpe wrote:
> On Sat, Aug 22, 2020 at 01:47:12AM +0200, Thomas Gleixner wrote:
>> > If the device has died the driver has code to detect and trigger a
>> > PCI function reset which will definitely stop the interrupt.
>> 
>> If that interrupt is gone into storm mode for some reason then this will
>> render your machine unusable before you can do that.
>
> Yes, but in general the HW design is to have one-shot interrupts, it
> would have to be well off the rails to storm. The kind of off the
> rails where it could also be doing crazy stuff on PCI-E that would be
> very harmful.

Yeah. One shot should prevent most of the wreckage. I just wanted to
spell it out.

>> One question is whether the device can see partial updates to that
>> memory due to the async 'swap' of context from the device CPU.
>
> It is worse than just partial updates.. The device operation is much
> more like you'd imagine a CPU cache. There could be copies of the RAM
> in the device for long periods of time, dirty data in the device that
> will flush back to CPU RAM overwriting CPU changes, etc.

TBH, that's insane. You clearly want to think about this some more. If
you swap out device state and device control state then you definitly
want to have regions which are read only from the device POV and never
written back. The MSI msg store clearly belongs into that category.
But that's not restricted to the MSI msg store, there is certainly other
stuff which never wants to be written back by the device.

If you don't do that then you simply can't write to that space from the
CPU and you have to transport this kind information always via command
queues.

But that does not make sense. It's trivial enough to have

| RO state |
| RW state |

and on swap in the whole thing is DMA'd into the device and on swap out
only the RW state part. It's not rocket science and makes a huge amount
of sense.

> Without involving the device there is just no way to create data
> consistency, and no way to change the data from the CPU. 
>
> This is the down side of having device data in the RAM. It cannot be
> so simple as 'just fetch it every time before you use it' as
> performance would be horrible.

That's clear, but with a proper seperation like the above and some extra
mechanism which allows you to tickle a relaod of 'RO state' you can
avoid quite some of the problems which you create otherwise.

>> If we really can get away with atomically updating the message as
>> outlined above and just let it happen at some point in the future then
>> most problems are solved, except for the nastyness of CPU hotplug.
>
> Since we can't avoid a device command, I'm think more along the lines
> of having the affinity update trigger an async WQ to issue the command
> from a thread context. Since it doesn't need to be synchronous it can
> make it out 'eventually'.
>
> I suppose the core code could provide this as a service? Sort of a
> varient of the other lazy things above?

Kinda. That needs a lot of thought for the affinity setting stuff
because it can be called from contexts which do not allow that. It's
solvable though, but I clearly need to stare at the corner cases for a
while.

> But it would have to work with ARM - is remapping a x86 only thing?

No. ARM64 has that as well.

> Does ARM put the affinity in the GIC tables not in the MSI data?

IIRC, yes.

>> Let me summarize what I think would be the sane solution for this:
>> 
>>   1) Utilize atomic writes for either all 16 bytes or reorder the bytes
>>  and update 8 bytes atomically which is sufficient as the wide
>>  address is only used with irq remapping and the MSI message in the
>>  device is never changed after startup.
>
> Sadly not something the device can manage due to data coherence

I disagree :)

>>   2) No requirement for issuing a command for regular migration
>>  operations as they have no requirements to be synchronous.
>> 
>>  Eventually store some state to force a reload on the next regular
>>  queue operation.
>
> Would the async version above be OK?

Async is fine in any variant (except for hotplug). Though having an
async WQ or whatever there needs some thought.

>>   3) No requirement for issuing a command for mask and unmask operations.
>>  The core code uses and handles lazy masking already. So if the
>>  hardware causes the lazyness, so be it.
>
> This lazy masking thing sounds good, I'm totally unfamiliar with it
> though.

It's used to avoid irq chip (often MMIO) access in scenarios where
disable/enable of an interrupt line happens with high frequency. Serial
has that issue. So we mark it disabled, but do not mask it and the core
can handle that and masks it once an interrupt comes in in masked
state. That obviously does not work out of the box to protect against
not disabled but masked state, but conceptually it's a similar problem
and can be made work without massive changes. 

OTOH, in normal

Re: [patch RFC 38/38] irqchip: Add IMS array driver - NOT FOR MERGING

2020-08-21 Thread Jason Gunthorpe

On Sat, Aug 22, 2020 at 01:47:12AM +0200, Thomas Gleixner wrote:
> On Fri, Aug 21 2020 at 17:17, Jason Gunthorpe wrote:
> > On Fri, Aug 21, 2020 at 09:47:43PM +0200, Thomas Gleixner wrote:
> >> So if I understand correctly then the queue memory where the MSI
> >> descriptor sits is in RAM.
> >
> > Yes, IMHO that is the whole point of this 'IMS' stuff. If devices
> > could have enough on-die memory then they could just use really big
> > MSI-X tables. Currently due to on-die memory constraints mlx5 is
> > limited to a few hundred MSI-X vectors.
> 
> Right, that's the limit of a particular device, but nothing prevents you
> to have a larger table on a new device.

Well, physics are a problem.. The SRAM to store the MSI vectors costs
die space and making the chip die larger is not an option. So the
question is what do you throw out of the chip to get a 10-20x increase
in MSI SRAM?

This is why using host memory is so appealing. It is
economically/functionally better.

I'm going to guess other HW is in the same situation, virtualization
is really pushing up the number of required IRQs.

> >> How is that supposed to work if interrupt remapping is disabled?
> >
> > The best we can do is issue a command to the device and spin/sleep
> > until completion. The device will serialize everything internally.
> >
> > If the device has died the driver has code to detect and trigger a
> > PCI function reset which will definitely stop the interrupt.
> 
> If that interrupt is gone into storm mode for some reason then this will
> render your machine unusable before you can do that.

Yes, but in general the HW design is to have one-shot interrupts, it
would have to be well off the rails to storm. The kind of off the
rails where it could also be doing crazy stuff on PCI-E that would be
very harmful.

> > So, the implementation of these functions would be to push any change
> > onto a command queue, trigger the device to DMA the command, spin/sleep
> > until the device returns a response and then continue on. If the
> > device doesn't return a response in a time window then trigger a WQ to
> > do a full device reset.
> 
> I really don't want to do that with the irq descriptor lock held or in
> case of affinity from the interrupt handler as we have to do with PCI
> MSI/MSI-X due to the horrors of the X86 interrupt delivery trainwreck.
> Also you cannot call into command queue code from interrupt disabled and
> interrupt descriptor lock held sections. You can try, but lockdep will
> yell at you immediately.

Yes, I wouldn't want to do this from an IRQ.

> One question is whether the device can see partial updates to that
> memory due to the async 'swap' of context from the device CPU.

It is worse than just partial updates.. The device operation is much
more like you'd imagine a CPU cache. There could be copies of the RAM
in the device for long periods of time, dirty data in the device that
will flush back to CPU RAM overwriting CPU changes, etc.

Without involving the device there is just no way to create data
consistency, and no way to change the data from the CPU. 

This is the down side of having device data in the RAM. It cannot be
so simple as 'just fetch it every time before you use it' as
performance would be horrible.

> irq chips have already a mechanism in place to deal with stuff which
> cannot be handled from within the irq descriptor spinlock held and
> interrupt disabled section.
> 
> The mechanism was invented to deal with interrupt chips which are
> connected to i2c, spi, etc.. The access to an interrupt chip control
> register has to queue stuff on the bus and wait for completion.
> Obviously not what you can do from interrupt disabled, raw spinlock held
> context either.

Ah intersting, sounds like the right parts! I didn't know about this..

> Now coming back to affinity setting. I'd love to avoid adding the bus
> lock magic to those interfaces because until now they can be called and
> are called from atomic contexts. And obviously none of the devices which
> use the buslock magic support affinity setting because they all deliver
> a single interrupt to a demultiplex interrupt and that one is usually
> sitting at the CPU level where interrupt steering works.
> 
> If we really can get away with atomically updating the message as
> outlined above and just let it happen at some point in the future then
> most problems are solved, except for the nastyness of CPU hotplug.

Since we can't avoid a device command, I'm think more along the lines
of having the affinity update trigger an async WQ to issue the command
from a thread context. Since it doesn't need to be synchronous it can
make it out 'eventually'.

I suppose the core code could provide this as a service? Sort of a
varient of the other lazy things above?

> But that's actually a non issue. Nothing prevents us from having an
> early 'migrate interrupts away from the outgoing CPU hotplug state'
> which runs in thread context and can therefore utilize the

RE: [PATCH v6 1/2] dma-contiguous: provide the ability to reserve per-numa CMA

2020-08-21 Thread Song Bao Hua (Barry Song)




> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org
> [mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Randy Dunlap
> Sent: Friday, August 21, 2020 2:50 PM
> To: Song Bao Hua (Barry Song) ; h...@lst.de;
> m.szyprow...@samsung.com; robin.mur...@arm.com; w...@kernel.org;
> ganapatrao.kulka...@cavium.com; catalin.mari...@arm.com
> Cc: iommu@lists.linux-foundation.org; Linuxarm ;
> linux-arm-ker...@lists.infradead.org; linux-ker...@vger.kernel.org;
> huangdaode ; Jonathan Cameron
> ; Nicolas Saenz Julienne
> ; Steve Capper ; Andrew
> Morton ; Mike Rapoport 
> Subject: Re: [PATCH v6 1/2] dma-contiguous: provide the ability to reserve
> per-numa CMA
> 
> On 8/20/20 7:26 PM, Barry Song wrote:
> >
> >
> > Cc: Jonathan Cameron 
> > Cc: Christoph Hellwig 
> > Cc: Marek Szyprowski 
> > Cc: Will Deacon 
> > Cc: Robin Murphy 
> > Cc: Ganapatrao Kulkarni 
> > Cc: Catalin Marinas 
> > Cc: Nicolas Saenz Julienne 
> > Cc: Steve Capper 
> > Cc: Andrew Morton 
> > Cc: Mike Rapoport 
> > Signed-off-by: Barry Song 
> > ---
> >  v6: rebase on top of 5.9-rc1;
> >  doc cleanup
> >
> >  .../admin-guide/kernel-parameters.txt |   9 ++
> >  include/linux/dma-contiguous.h|   6 ++
> >  kernel/dma/Kconfig|  10 ++
> >  kernel/dma/contiguous.c   | 100
> --
> >  4 files changed, 115 insertions(+), 10 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt
> b/Documentation/admin-guide/kernel-parameters.txt
> > index bdc1f33fd3d1..3f33b89aeab5 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -599,6 +599,15 @@
> > altogether. For more information, see
> > include/linux/dma-contiguous.h
> >
> > +   pernuma_cma=nn[MG]
> 
> memparse() allows any one of these suffixes: K, M, G, T, P, E
> and nothing in the option parsing function cares what suffix is used...

Hello Randy,
Thanks for your comments.

Actually I am following the suffix of default cma:
cma=nn[MG]@[start[MG][-end[MG]]]
[ARM,X86,KNL]
Sets the size of kernel global memory area for
contiguous memory allocations and optionally the
placement constraint by the physical address range of
memory allocations. A value of 0 disables CMA
altogether. For more information, see
include/linux/dma-contiguous.h

I suggest users should set the size in either MB or GB as they set cma. 

> 
> > +   [ARM64,KNL]
> > +   Sets the size of kernel per-numa memory area for
> > +   contiguous memory allocations. A value of 0 disables
> > +   per-numa CMA altogether. DMA users on node nid will
> > +   first try to allocate buffer from the pernuma area
> > +   which is located in node nid, if the allocation fails,
> > +   they will fallback to the global default memory area.
> > +
> > cmo_free_hint=  [PPC] Format: { yes | no }
> > Specify whether pages are marked as being inactive
> > when they are freed.  This is used in CMO environments
> 
> > diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c
> > index cff7e60968b9..89b95f10e56d 100644
> > --- a/kernel/dma/contiguous.c
> > +++ b/kernel/dma/contiguous.c
> > @@ -69,6 +69,19 @@ static int __init early_cma(char *p)
> >  }
> >  early_param("cma", early_cma);
> >
> > +#ifdef CONFIG_DMA_PERNUMA_CMA
> > +
> > +static struct cma *dma_contiguous_pernuma_area[MAX_NUMNODES];
> > +static phys_addr_t pernuma_size_bytes __initdata;
> 
> why phys_addr_t? couldn't it just be unsigned long long?
> 

Mainly because of following the programming habit in kernel/dma/contiguous.c:
I think the original code probably meant the size should not be larger than the 
MAXIMUM
value of phys_addr_t:

/*
 * Default global CMA area size can be defined in kernel's .config.
 * This is useful mainly for distro maintainers to create a kernel
 * that works correctly for most supported systems.
 * The size can be set in bytes or as a percentage of the total memory
 * in the system.
 *
 * Users, who want to set the size of global CMA area for their system
 * should use cma= kernel parameter.
 */
static const phys_addr_t size_bytes __initconst =
(phys_addr_t)CMA_SIZE_MBYTES * SZ_1M;
static phys_addr_t  size_cmdline __initdata = -1;
static phys_addr_t base_cmdline __initdata;
static phys_addr_t limit_cmdline __initdata;

void __init dma_contiguous_reserve(phys_addr_t limit)
{
phys_addr_t selected_size = 0;
phys_addr_t selected_base = 0;
phys_addr_t selected_limit = limit;
bool fixed = false;

pr_debug("%s(limit %08lx)\n", __func__, (unsigned long)limit);

Re: [PATCH RESEND v10 07/11] device-mapping: Introduce DMA range map, supplanting dma_pfn_offset

2020-08-21 Thread Andy Shevchenko

On Thu, Aug 20, 2020 at 09:37:12AM -0400, Jim Quinlan wrote:
> On Tue, Aug 18, 2020 at 4:14 AM Andy Shevchenko
>  wrote:
> > On Mon, Aug 17, 2020 at 05:53:09PM -0400, Jim Quinlan wrote:

...

> > > +static inline u64 dma_offset_from_dma_addr(struct device *dev, 
> > > dma_addr_t dma_addr)
> > > +{
> > > + const struct bus_dma_region *m = dev->dma_range_map;
> > > +
> > > + if (!m)
> > > + return 0;
> > > + for (; m->size; m++)
> > > + if (dma_addr >= m->dma_start && dma_addr - m->dma_start < 
> > > m->size)
> > > + return m->offset;
> > > + return 0;
> > > +}
> > > +
> > > +static inline u64 dma_offset_from_phys_addr(struct device *dev, 
> > > phys_addr_t paddr)
> > > +{
> > > + const struct bus_dma_region *m = dev->dma_range_map;
> > > +
> > > + if (!m)
> > > + return 0;
> > > + for (; m->size; m++)
> > > + if (paddr >= m->cpu_start && paddr - m->cpu_start < m->size)
> > > + return m->offset;
> > > + return 0;
> > > +}
> >
> > Perhaps for these the form with one return 0 is easier to read
> >
> > if (m) {
> > for (; m->size; m++)
> > if (paddr >= m->cpu_start && paddr - m->cpu_start < 
> > m->size)
> > return m->offset;
> > }
> > return 0;
> >
> > ?
> I see what you are saying but I don't think there is enough difference
> between the two to justify changing it.

The difference is that you have return 0 / non-0 cases one each. I think it's
slightly easier to read and understand, but it's up to you.

-- 
With Best Regards,
Andy Shevchenko


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v2 1/3] iommu/virtio: Add topology description to virtio-iommu config space

2020-08-21 Thread Jean-Philippe Brucker

Hi,

While preparing the next version I noticed I forgot to send this reply.
Better late than never I suppose...

On Tue, Apr 21, 2020 at 07:31:12AM +, Tian, Kevin wrote:
> > From: Jean-Philippe Brucker
> > Sent: Saturday, February 29, 2020 1:26 AM
> > 
> > Platforms without device-tree do not currently have a method for
> > describing the vIOMMU topology. Provide a topology description embedded
> > into the virtio device.
[...]
> > diff --git a/drivers/iommu/virtio-iommu-topology.c b/drivers/iommu/virtio-
> > iommu-topology.c
> > new file mode 100644
> > index ..2188624ef216
> > --- /dev/null
> > +++ b/drivers/iommu/virtio-iommu-topology.c
> > @@ -0,0 +1,343 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> > +
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +
> > +struct viommu_cap_config {
> > +   u8 bar;
> > +   u32 length; /* structure size */
> > +   u32 offset; /* structure offset within the bar */
> > +};
> > +
> > +union viommu_topo_cfg {
> > +   __le16  type;
> > +   struct virtio_iommu_topo_pci_range  pci;
> > +   struct virtio_iommu_topo_endpoint   ep;
> > +};
> > +
> > +struct viommu_spec {
> > +   struct device   *dev; /* transport device */
> > +   struct fwnode_handle*fwnode;
> > +   struct iommu_ops*ops;
> > +   struct list_headlist;
> > +   size_t  num_items;
> 
> Intel DMAR allows an IOMMU to claim INCLUDE_ALL thus avoid listing
> every endpoint one-by-one. It is especially useful when there is only
> one IOMMU device in the system. Do you think whether making sense
> to allow such optimization in this spec?

The DMAR INCLUDE_PCI_ALL is for a single PCI domain, so I think is
equivalent to having a single virtio_iommu_topo_pci_range structure with
start=0 and end=0x. That only takes 16 bytes of config space and is
pretty easy to parse, so a special case doesn't seem necessary to me.

If more than one PCI domain is managed by the IOMMU, then INCLUDE_ALL
isn't sufficient since we need to describe how endpoint IDs are associated
to domain:RID (one of the domains would have its endpoint IDs = RID +
0x1 for example). Furthermore non-PCI devices don't have an implicit
endpoint ID like the RID.

Thanks,
Jean

> It doesn't work for ARM since
> you need ID mapping to find the MSI doorbell. But for architectures
> where only topology info is required, it makes the enumeration process
> much simpler.
> 
> Thanks
> Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v6 1/2] dma-contiguous: provide the ability to reserve per-numa CMA

2020-08-21 Thread Will Deacon

On Fri, Aug 21, 2020 at 02:26:14PM +1200, Barry Song wrote:
> diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> b/Documentation/admin-guide/kernel-parameters.txt
> index bdc1f33fd3d1..3f33b89aeab5 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -599,6 +599,15 @@
>   altogether. For more information, see
>   include/linux/dma-contiguous.h
>  
> + pernuma_cma=nn[MG]
> + [ARM64,KNL]
> + Sets the size of kernel per-numa memory area for
> + contiguous memory allocations. A value of 0 disables
> + per-numa CMA altogether. DMA users on node nid will
> + first try to allocate buffer from the pernuma area
> + which is located in node nid, if the allocation fails,
> + they will fallback to the global default memory area.

What is the default behaviour if this option is not specified? Seems like
that should be mentioned here.

> diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
> index 847a9d1fa634..db7a37ed35eb 100644
> --- a/kernel/dma/Kconfig
> +++ b/kernel/dma/Kconfig
> @@ -118,6 +118,16 @@ config DMA_CMA
> If unsure, say "n".
>  
>  if  DMA_CMA
> +
> +config DMA_PERNUMA_CMA
> + bool "Enable separate DMA Contiguous Memory Area for each NUMA Node"

I don't understand the need for this config option. If you have DMA_DMA and
you have NUMA, why wouldn't you want this enabled?

> + help
> +   Enable this option to get pernuma CMA areas so that devices like
> +   ARM64 SMMU can get local memory by DMA coherent APIs.
> +
> +   You can set the size of pernuma CMA by specifying "pernuma_cma=size"
> +   on the kernel's command line.
> +
>  comment "Default contiguous memory area size:"
>  
>  config CMA_SIZE_MBYTES
> diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c
> index cff7e60968b9..89b95f10e56d 100644
> --- a/kernel/dma/contiguous.c
> +++ b/kernel/dma/contiguous.c
> @@ -69,6 +69,19 @@ static int __init early_cma(char *p)
>  }
>  early_param("cma", early_cma);
>  
> +#ifdef CONFIG_DMA_PERNUMA_CMA
> +
> +static struct cma *dma_contiguous_pernuma_area[MAX_NUMNODES];
> +static phys_addr_t pernuma_size_bytes __initdata;
> +
> +static int __init early_pernuma_cma(char *p)
> +{
> + pernuma_size_bytes = memparse(p, );
> + return 0;
> +}
> +early_param("pernuma_cma", early_pernuma_cma);
> +#endif
> +
>  #ifdef CONFIG_CMA_SIZE_PERCENTAGE
>  
>  static phys_addr_t __init __maybe_unused cma_early_percent_memory(void)
> @@ -96,6 +109,34 @@ static inline __maybe_unused phys_addr_t 
> cma_early_percent_memory(void)
>  
>  #endif
>  
> +#ifdef CONFIG_DMA_PERNUMA_CMA
> +void __init dma_pernuma_cma_reserve(void)
> +{
> + int nid;
> +
> + if (!pernuma_size_bytes)
> + return;

If this is useful (I assume it is), then I think we should have a non-zero
default value, a bit like normal CMA does via CMA_SIZE_MBYTES.

> + for_each_node_state(nid, N_ONLINE) {

for_each_online_node() {

> + int ret;
> + char name[20];

20?

Ah, wait, this is copy-pasta from hugetlb_cma_reserve(). Can you factor out
the common parts at all?

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v6 2/2] arm64: mm: reserve per-numa CMA to localize coherent dma buffers

2020-08-21 Thread Will Deacon

On Fri, Aug 21, 2020 at 02:26:15PM +1200, Barry Song wrote:
> Right now, smmu is using dma_alloc_coherent() to get memory to save queues
> and tables. Typically, on ARM64 server, there is a default CMA located at
> node0, which could be far away from node2, node3 etc.
> with this patch, smmu will get memory from local numa node to save command
> queues and page tables. that means dma_unmap latency will be shrunk much.
> Meanwhile, when iommu.passthrough is on, device drivers which call dma_
> alloc_coherent() will also get local memory and avoid the travel between
> numa nodes.
> 
> Cc: Christoph Hellwig 
> Cc: Marek Szyprowski 
> Cc: Will Deacon 
> Cc: Robin Murphy 
> Cc: Ganapatrao Kulkarni 
> Cc: Catalin Marinas 
> Cc: Nicolas Saenz Julienne 
> Cc: Steve Capper 
> Cc: Andrew Morton 
> Cc: Mike Rapoport 
> Signed-off-by: Barry Song 
> ---
>  -v6: rebase on top of 5.9-rc1
> 
>  arch/arm64/mm/init.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 481d22c32a2e..f1c75957ff3c 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -429,6 +429,8 @@ void __init bootmem_init(void)
>   arm64_hugetlb_cma_reserve();
>  #endif
>  
> + dma_pernuma_cma_reserve();

I think will have to do for now, but I still wish that more of this was
driven from the core code so that we don't have to worry about
initialisation order and whether things are early/late enough on a per-arch
basis.

Acked-by: Will Deacon 

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v6 0/2] make dma_alloc_coherent NUMA-aware by per-NUMA CMA

2020-08-21 Thread Christoph Hellwig

FYI, as of the last one I'm fine now, bit I really need an ACK from
the arm64 maintainers.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 07/18] iommu/arm-smmu: Remove arch/arm workaround

2020-08-21 Thread Will Deacon

On Thu, Aug 20, 2020 at 04:08:26PM +0100, Robin Murphy wrote:
> Now that arch/arm is wired up for default domains and iommu-dma, remove
> the add_device workaround.
> 
> Signed-off-by: Robin Murphy 
> ---
>  drivers/iommu/arm/arm-smmu/arm-smmu.c | 10 --
>  1 file changed, 10 deletions(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c 
> b/drivers/iommu/arm/arm-smmu/arm-smmu.c
> index 09c42af9f31e..4e52d8cb67dd 100644
> --- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
> +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
> @@ -1164,17 +1164,7 @@ static int arm_smmu_attach_dev(struct iommu_domain 
> *domain, struct device *dev)
>   return -ENXIO;
>   }
>  
> - /*
> -  * FIXME: The arch/arm DMA API code tries to attach devices to its own
> -  * domains between of_xlate() and probe_device() - we have no way to 
> cope
> -  * with that, so until ARM gets converted to rely on groups and default
> -  * domains, just say no (but more politely than by dereferencing NULL).
> -  * This should be at least a WARN_ON once that's sorted.
> -  */
>   cfg = dev_iommu_priv_get(dev);
> - if (!cfg)
> - return -ENODEV;
> -
>   smmu = cfg->smmu;
>  
>   ret = arm_smmu_rpm_get(smmu);

Acked-by: Will Deacon 

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v6 0/2] make dma_alloc_coherent NUMA-aware by per-NUMA CMA

2020-08-21 Thread Will Deacon

On Fri, Aug 21, 2020 at 08:19:18AM +0200, Christoph Hellwig wrote:
> FYI, as of the last one I'm fine now, bit I really need an ACK from
> the arm64 maintainers.

Going through the dreaded backlog this morning...

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

52 matches

Mail list logo