RE: [PATCH] dma-mapping: benchmark: Extract a common header file for map_benchmark definition

2022-02-10 Thread Song Bao Hua (Barry Song) via iommu



> -Original Message-
> From: tiantao (H)
> Sent: Friday, February 11, 2022 4:15 PM
> To: Song Bao Hua (Barry Song) ; sh...@kernel.org;
> chenxiang (M) 
> Cc: iommu@lists.linux-foundation.org; linux-kselft...@vger.kernel.org;
> linux...@openeuler.org
> Subject: [PATCH] dma-mapping: benchmark: Extract a common header file for
> map_benchmark definition
> 
> kernel/dma/map_benchmark.c and selftests/dma/dma_map_benchmark.c
> have duplicate map_benchmark definitions, which tends to lead to
> inconsistent changes to map_benchmark on both sides, extract a
> common header file to avoid this problem.
> 
> Signed-off-by: Tian Tao 

+To: Christoph

Looks like a right cleanup. This will help decrease the maintain
overhead in the future. Other similar selftests tools are also
doing this.

Acked-by: Barry Song 

> ---
>  kernel/dma/map_benchmark.c| 24 +-
>  kernel/dma/map_benchmark.h| 31 +++
>  .../testing/selftests/dma/dma_map_benchmark.c | 25 +--
>  3 files changed, 33 insertions(+), 47 deletions(-)
>  create mode 100644 kernel/dma/map_benchmark.h
> 
> diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c
> index 9b9af1bd6be3..c05f4e242991 100644
> --- a/kernel/dma/map_benchmark.c
> +++ b/kernel/dma/map_benchmark.c
> @@ -18,29 +18,7 @@
>  #include 
>  #include 
> 
> -#define DMA_MAP_BENCHMARK_IOWR('d', 1, struct map_benchmark)
> -#define DMA_MAP_MAX_THREADS  1024
> -#define DMA_MAP_MAX_SECONDS  300
> -#define DMA_MAP_MAX_TRANS_DELAY  (10 * NSEC_PER_MSEC)
> -
> -#define DMA_MAP_BIDIRECTIONAL0
> -#define DMA_MAP_TO_DEVICE1
> -#define DMA_MAP_FROM_DEVICE  2
> -
> -struct map_benchmark {
> - __u64 avg_map_100ns; /* average map latency in 100ns */
> - __u64 map_stddev; /* standard deviation of map latency */
> - __u64 avg_unmap_100ns; /* as above */
> - __u64 unmap_stddev;
> - __u32 threads; /* how many threads will do map/unmap in parallel */
> - __u32 seconds; /* how long the test will last */
> - __s32 node; /* which numa node this benchmark will run on */
> - __u32 dma_bits; /* DMA addressing capability */
> - __u32 dma_dir; /* DMA data direction */
> - __u32 dma_trans_ns; /* time for DMA transmission in ns */
> - __u32 granule;  /* how many PAGE_SIZE will do map/unmap once a time */
> - __u8 expansion[76]; /* For future use */
> -};
> +#include "map_benchmark.h"
> 
>  struct map_benchmark_data {
>   struct map_benchmark bparam;
> diff --git a/kernel/dma/map_benchmark.h b/kernel/dma/map_benchmark.h
> new file mode 100644
> index ..62674c83bde4
> --- /dev/null
> +++ b/kernel/dma/map_benchmark.h
> @@ -0,0 +1,31 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright (C) 2022 HiSilicon Limited.
> + */
> +
> +#ifndef _KERNEL_DMA_BENCHMARK_H
> +#define _KERNEL_DMA_BENCHMARK_H
> +
> +#define DMA_MAP_BENCHMARK   _IOWR('d', 1, struct map_benchmark)
> +#define DMA_MAP_MAX_THREADS 1024
> +#define DMA_MAP_MAX_SECONDS 300
> +#define DMA_MAP_MAX_TRANS_DELAY (10 * NSEC_PER_MSEC)
> +
> +#define DMA_MAP_BIDIRECTIONAL   0
> +#define DMA_MAP_TO_DEVICE   1
> +#define DMA_MAP_FROM_DEVICE 2
> +
> +struct map_benchmark {
> + __u64 avg_map_100ns; /* average map latency in 100ns */
> + __u64 map_stddev; /* standard deviation of map latency */
> + __u64 avg_unmap_100ns; /* as above */
> + __u64 unmap_stddev;
> + __u32 threads; /* how many threads will do map/unmap in parallel */
> + __u32 seconds; /* how long the test will last */
> + __s32 node; /* which numa node this benchmark will run on */
> + __u32 dma_bits; /* DMA addressing capability */
> + __u32 dma_dir; /* DMA data direction */
> + __u32 dma_trans_ns; /* time for DMA transmission in ns */
> + __u32 granule;  /* how many PAGE_SIZE will do map/unmap once a time */
> +};
> +#endif /* _KERNEL_DMA_BENCHMARK_H */
> diff --git a/tools/testing/selftests/dma/dma_map_benchmark.c
> b/tools/testing/selftests/dma/dma_map_benchmark.c
> index 485dff51bad2..33bf073071aa 100644
> --- a/tools/testing/selftests/dma/dma_map_benchmark.c
> +++ b/tools/testing/selftests/dma/dma_map_benchmark.c
> @@ -11,39 +11,16 @@
>  #include 
>  #include 
>  #include 
> +#include "../../../../kernel/dma/map_benchmark.h"
> 
>  #define NSEC_PER_MSEC100L
> 
> -#define DMA_MAP_BENCHMARK_IOWR('d', 1, struct map_benchmark)
> -#define DMA_MAP_MAX_THREADS  1024
> -#define DMA_MAP_MAX_SECONDS 300
> -#define DMA_MAP_MAX_TRANS_DELAY  (10 * NSEC_PER_MSEC)
> -
> -#define DMA_

RE: [PATCH] MAINTAINERS: Update maintainer list of DMA MAPPING BENCHMARK

2022-02-08 Thread Song Bao Hua (Barry Song) via iommu



> -Original Message-
> From: chenxiang (M)
> Sent: Tuesday, February 8, 2022 8:05 PM
> To: Song Bao Hua (Barry Song) ; h...@lst.de;
> m.szyprow...@samsung.com; robin.mur...@arm.com
> Cc: linux...@openeuler.org; Linuxarm ;
> iommu@lists.linux-foundation.org; linux-kselft...@vger.kernel.org;
> chenxiang (M) 
> Subject: [PATCH] MAINTAINERS: Update maintainer list of DMA MAPPING BENCHMARK
> 
> From: Xiang Chen 
> 
> Barry Song will not focus on this area, and Xiang Chen will continue his
> work to maintain this module.
> 
> Signed-off-by: Xiang Chen 

Acked-by: Barry Song 

Xiang has been an user of this module and has made substantial
contributions not only to this module and but also to related
modules such as iommu/arm-smmu-v3.
My this email account will be unreachable in this month. And
probably I will rarely work on this module afterwards. So I
am happy Xiang will take care of it. Thanks!

> ---
>  MAINTAINERS | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index ea3e6c914384..48335022b0e4 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -5765,7 +5765,7 @@ F:  include/linux/dma-map-ops.h
>  F:   kernel/dma/
> 
>  DMA MAPPING BENCHMARK
> -M:   Barry Song 
> +M:   Xiang Chen 
>  L:   iommu@lists.linux-foundation.org
>  F:   kernel/dma/map_benchmark.c
>  F:   tools/testing/selftests/dma/
> --
> 2.33.0

Best Regards,
Barry
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC PATCH] provide per numa cma with an initial default size

2021-12-06 Thread Song Bao Hua (Barry Song) via iommu



> -Original Message-
> From: Robin Murphy [mailto:robin.mur...@arm.com]
> Sent: Tuesday, December 7, 2021 4:01 AM
> To: Jay Chen ; h...@lst.de; 
> m.szyprow...@samsung.com;
> linux-ker...@vger.kernel.org; iommu@lists.linux-foundation.org; Song Bao Hua
> (Barry Song) 
> Cc: zhangligu...@linux.alibaba.com
> Subject: Re: [RFC PATCH] provide per numa cma with an initial default size
> 
> [ +Barry ]
> 
> On 2021-11-30 07:45, Jay Chen wrote:
> >In the actual production environment, when we open
> > cma and per numa cma, if we do not increase the per
> > numa size configuration in cmdline, we find that our
> > performance has dropped by 20%.
> >Through analysis, we found that the default size of
> > per numa is 0, which causes the driver to allocate
> > memory from cma, which affects performance. Therefore,
> > we think we need to provide a default size.
> 
> Looking back at some of the review discussions, I think it may have been
> intentional that per-node areas are not allocated by default, since it's
> the kind of thing that really wants to be tuned to the particular system
> and workload, and as such it seemed reasonable to expect users to
> provide a value on the command line if they wanted the feature. That's
> certainly what the Kconfig text implies.
> 
> Thanks,
> Robin.
> 
> > Signed-off-by: Jay Chen 
> > ---
> >   kernel/dma/contiguous.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c
> > index 3d63d91cba5c..3bef8bf371d9 100644
> > --- a/kernel/dma/contiguous.c
> > +++ b/kernel/dma/contiguous.c
> > @@ -99,7 +99,7 @@ early_param("cma", early_cma);
> >   #ifdef CONFIG_DMA_PERNUMA_CMA
> >
> >   static struct cma *dma_contiguous_pernuma_area[MAX_NUMNODES];
> > -static phys_addr_t pernuma_size_bytes __initdata;
> > +static phys_addr_t pernuma_size_bytes __initdata = size_bytes;

I don't think the size for the default cma can apply to
per-numa CMA.

We did have some discussion regarding the size when per-numa cma was
added, and it was done by a Kconfig option. I think we have decided
to not have any default size other than 0. Default size 0 is perfect,
this will enforce users to set a proper "cma_pernuma=" bootargs.

> >
> >   static int __init early_cma_pernuma(char *p)
> >   {
> >

Thanks
Barry
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH] dma-mapping: benchmark: use the correct HiSilicon copyright

2021-03-31 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: fanghao (A)
> Sent: Tuesday, March 30, 2021 7:34 PM
> To: h...@lst.de; m.szyprow...@samsung.com; robin.mur...@arm.com; Song Bao Hua
> (Barry Song) 
> Cc: iommu@lists.linux-foundation.org; linux...@openeuler.org;
> linux-kselft...@vger.kernel.org; fanghao (A) 
> Subject: [PATCH] dma-mapping: benchmark: use the correct HiSilicon copyright
> 
> s/Hisilicon/HiSilicon/g.
> It should use capital S, according to
> https://www.hisilicon.com/en/terms-of-use.
> 

My bad. Thanks.

Acked-by: Barry Song 

> Signed-off-by: Hao Fang 
> ---
>  kernel/dma/map_benchmark.c  | 2 +-
>  tools/testing/selftests/dma/dma_map_benchmark.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c
> index e0e64f8..00d6549 100644
> --- a/kernel/dma/map_benchmark.c
> +++ b/kernel/dma/map_benchmark.c
> @@ -1,6 +1,6 @@
>  // SPDX-License-Identifier: GPL-2.0-only
>  /*
> - * Copyright (C) 2020 Hisilicon Limited.
> + * Copyright (C) 2020 HiSilicon Limited.
>   */
> 
>  #define pr_fmt(fmt)  KBUILD_MODNAME ": " fmt
> diff --git a/tools/testing/selftests/dma/dma_map_benchmark.c
> b/tools/testing/selftests/dma/dma_map_benchmark.c
> index fb23ce9..b492bed 100644
> --- a/tools/testing/selftests/dma/dma_map_benchmark.c
> +++ b/tools/testing/selftests/dma/dma_map_benchmark.c
> @@ -1,6 +1,6 @@
>  // SPDX-License-Identifier: GPL-2.0-only
>  /*
> - * Copyright (C) 2020 Hisilicon Limited.
> + * Copyright (C) 2020 HiSilicon Limited.
>   */
> 
>  #include 
> --
> 2.8.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH] dma-mapping: make map_benchmark compile into module

2021-03-24 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Christoph Hellwig [mailto:h...@lst.de]
> Sent: Wednesday, March 24, 2021 8:13 PM
> To: tiantao (H) 
> Cc: a...@linux-foundation.org; pet...@infradead.org; paul...@kernel.org;
> a...@kernel.org; t...@linutronix.de; rost...@goodmis.org; h...@lst.de;
> m.szyprow...@samsung.com; Song Bao Hua (Barry Song)
> ; iommu@lists.linux-foundation.org;
> linux-ker...@vger.kernel.org
> Subject: Re: [PATCH] dma-mapping: make map_benchmark compile into module
> 
> On Wed, Mar 24, 2021 at 10:17:38AM +0800, Tian Tao wrote:
> > under some scenarios, it is necessary to compile map_benchmark
> > into module to test iommu, so this patch changed Kconfig and
> > export_symbol to implement map_benchmark compiled into module.
> >
> > On the other hand, map_benchmark is a driver, which is supposed
> > to be able to run as a module.
> >
> > Signed-off-by: Tian Tao 
> 
> Nope, we're not going to export more kthread internals for a test
> module.

The requirement comes from an colleague who is frequently changing
the map-bench code for some customized test purpose. and he doesn't
want to build kernel image and reboot every time. So I moved the
requirement to Tao Tian.

Right now, kthread_bind() is exported, kthread_bind_mask() seems
to be a little bit "internal" as you said, maybe a wrapper like
kthread_bind_node() won't be that "internal", comparing to exposing
the cpumask?
Anyway, we don't find other driver users for this, hardly I can
convince you it is worth.

Thanks
Barry
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH] dma-mapping: make map_benchmark compile into module

2021-03-23 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: tiantao (H)
> Sent: Wednesday, March 24, 2021 3:18 PM
> To: a...@linux-foundation.org; pet...@infradead.org; paul...@kernel.org;
> a...@kernel.org; t...@linutronix.de; rost...@goodmis.org; h...@lst.de;
> m.szyprow...@samsung.com; Song Bao Hua (Barry Song)
> 
> Cc: iommu@lists.linux-foundation.org; linux-ker...@vger.kernel.org; tiantao
> (H) 
> Subject: [PATCH] dma-mapping: make map_benchmark compile into module
> 
> under some scenarios, it is necessary to compile map_benchmark
> into module to test iommu, so this patch changed Kconfig and
> export_symbol to implement map_benchmark compiled into module.
> 
> On the other hand, map_benchmark is a driver, which is supposed
> to be able to run as a module.
> 
> Signed-off-by: Tian Tao 
> ---

Acked-by: Barry Song 

Look sensible to me. I like the idea that map_benchmark is
a driver. It seems unreasonable to always require built-in.


>  kernel/dma/Kconfig | 2 +-
>  kernel/kthread.c   | 1 +
>  2 files changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
> index 77b4055..0468293 100644
> --- a/kernel/dma/Kconfig
> +++ b/kernel/dma/Kconfig
> @@ -223,7 +223,7 @@ config DMA_API_DEBUG_SG
> If unsure, say N.
> 
>  config DMA_MAP_BENCHMARK
> - bool "Enable benchmarking of streaming DMA mapping"
> + tristate "Enable benchmarking of streaming DMA mapping"
>   depends on DEBUG_FS
>   help
> Provides /sys/kernel/debug/dma_map_benchmark that helps with testing
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index 1578973..fa4736f 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -455,6 +455,7 @@ void kthread_bind_mask(struct task_struct *p, const struct
> cpumask *mask)
>  {
>   __kthread_bind_mask(p, mask, TASK_UNINTERRUPTIBLE);
>  }
> +EXPORT_SYMBOL(kthread_bind_mask);
> 
>  /**
>   * kthread_bind - bind a just-created kthread to a cpu.
> --
> 2.7.4

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH] dma-mapping: benchmark: Add support for multi-pages map/unmap

2021-03-18 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: chenxiang (M)
> Sent: Thursday, March 18, 2021 10:30 PM
> To: h...@lst.de; m.szyprow...@samsung.com; robin.mur...@arm.com; Song Bao Hua
> (Barry Song) 
> Cc: iommu@lists.linux-foundation.org; linux...@openeuler.org;
> linux-kselft...@vger.kernel.org; chenxiang (M) 
> Subject: [PATCH] dma-mapping: benchmark: Add support for multi-pages map/unmap
> 
> From: Xiang Chen 
> 
> Currently it only support one page map/unmap once a time for dma-map
> benchmark, but there are some other scenaries which need to support for
> multi-page map/unmap: for those multi-pages interfaces such as
> dma_alloc_coherent() and dma_map_sg(), the time spent on multi-pages
> map/unmap is not the time of a single page * npages (not linear) as it
> may use block description instead of page description when it is satified
> with the size such as 2M/1G, and also it can send a single TLB invalidation
> command to invalidate multi-pages instead of multi-times when RIL is
> enabled (which will short the time of unmap). So it is necessary to add
> support for multi-pages map/unmap.
> 
> Add a parameter "-g" to support multi-pages map/unmap.
> 
> Signed-off-by: Xiang Chen 
> ---

Acked-by: Barry Song 

>  kernel/dma/map_benchmark.c  | 21 ++---
>  tools/testing/selftests/dma/dma_map_benchmark.c | 20 
>  2 files changed, 30 insertions(+), 11 deletions(-)
> 
> diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c
> index e0e64f8..a5c1b01 100644
> --- a/kernel/dma/map_benchmark.c
> +++ b/kernel/dma/map_benchmark.c
> @@ -38,7 +38,8 @@ struct map_benchmark {
>   __u32 dma_bits; /* DMA addressing capability */
>   __u32 dma_dir; /* DMA data direction */
>   __u32 dma_trans_ns; /* time for DMA transmission in ns */
> - __u8 expansion[80]; /* For future use */
> + __u32 granule;  /* how many PAGE_SIZE will do map/unmap once a time */
> + __u8 expansion[76]; /* For future use */
>  };
> 
>  struct map_benchmark_data {
> @@ -58,9 +59,11 @@ static int map_benchmark_thread(void *data)
>   void *buf;
>   dma_addr_t dma_addr;
>   struct map_benchmark_data *map = data;
> + int npages = map->bparam.granule;
> + u64 size = npages * PAGE_SIZE;
>   int ret = 0;
> 
> - buf = (void *)__get_free_page(GFP_KERNEL);
> + buf = alloc_pages_exact(size, GFP_KERNEL);
>   if (!buf)
>   return -ENOMEM;
> 
> @@ -76,10 +79,10 @@ static int map_benchmark_thread(void *data)
>* 66 means evertything goes well! 66 is lucky.
>*/
>   if (map->dir != DMA_FROM_DEVICE)
> - memset(buf, 0x66, PAGE_SIZE);
> + memset(buf, 0x66, size);
> 
>   map_stime = ktime_get();
> - dma_addr = dma_map_single(map->dev, buf, PAGE_SIZE, map->dir);
> + dma_addr = dma_map_single(map->dev, buf, size, map->dir);
>   if (unlikely(dma_mapping_error(map->dev, dma_addr))) {
>   pr_err("dma_map_single failed on %s\n",
>   dev_name(map->dev));
> @@ -93,7 +96,7 @@ static int map_benchmark_thread(void *data)
>   ndelay(map->bparam.dma_trans_ns);
> 
>   unmap_stime = ktime_get();
> - dma_unmap_single(map->dev, dma_addr, PAGE_SIZE, map->dir);
> + dma_unmap_single(map->dev, dma_addr, size, map->dir);
>   unmap_etime = ktime_get();
>   unmap_delta = ktime_sub(unmap_etime, unmap_stime);
> 
> @@ -112,7 +115,7 @@ static int map_benchmark_thread(void *data)
>   }
> 
>  out:
> - free_page((unsigned long)buf);
> + free_pages_exact(buf, size);
>   return ret;
>  }
> 
> @@ -203,7 +206,6 @@ static long map_benchmark_ioctl(struct file *file, 
> unsigned
> int cmd,
>   struct map_benchmark_data *map = file->private_data;
>   void __user *argp = (void __user *)arg;
>   u64 old_dma_mask;
> -
>   int ret;
> 
>   if (copy_from_user(>bparam, argp, sizeof(map->bparam)))
> @@ -234,6 +236,11 @@ static long map_benchmark_ioctl(struct file *file, 
> unsigned
> int cmd,
>   return -EINVAL;
>   }
> 
> + if (map->bparam.granule < 1 || map->bparam.granule > 1024) {
> + pr_err("invalid granule size\n");
> + return -EINVAL;
> + }
> +
>   switch (map->bparam.dma_dir) {
>   case DMA_MAP_BIDIRECTIONAL:
>   map

RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin

2021-02-10 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Jason Gunthorpe [mailto:j...@ziepe.ca]
> Sent: Thursday, February 11, 2021 7:04 AM
> To: Song Bao Hua (Barry Song) 
> Cc: David Hildenbrand ; Wangzhou (B)
> ; linux-ker...@vger.kernel.org;
> iommu@lists.linux-foundation.org; linux...@kvack.org;
> linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew
> Morton ; Alexander Viro ;
> gre...@linuxfoundation.org; kevin.t...@intel.com; jean-phili...@linaro.org;
> eric.au...@redhat.com; Liguozhu (Kenneth) ;
> zhangfei@linaro.org; chensihang (A) 
> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
> pin
> 
> On Tue, Feb 09, 2021 at 10:22:47PM +, Song Bao Hua (Barry Song) wrote:
> 
> > The problem is that SVA declares we can use any memory of a process
> > to do I/O. And in real scenarios, we are unable to customize most
> > applications to make them use the pool. So we are looking for some
> > extension generically for applications such as Nginx, Ceph.
> 
> But those applications will suffer jitter even if their are using CPU
> to do the same work. I fail to see why adding an accelerator suddenly
> means the application owner will care about jitter introduced by
> migration/etc.

The only point for this is that when migration occurs on the accelerator,
the impact/jitter is much bigger than it does on CPU. Then the accelerator
might be unhelpful.

> 
> Again in proper SVA it should be quite unlikely to take a fault caused
> by something like migration, on the same likelyhood as the CPU. If
> things are faulting so much this is a problem then I think it is a
> system level problem with doing too much page motion.

My point is that single one SVA application shouldn't require system
to make global changes, such as disabling numa balancing, disabling
THP, to decrease page fault frequency by affecting other applications.

Anyway, guys are in lunar new year. Hopefully, we are getting more
real benchmark data afterwards to make the discussion more targeted.

> 
> Jason

Thanks
Barry
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin

2021-02-09 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Jason Gunthorpe [mailto:j...@ziepe.ca]
> Sent: Wednesday, February 10, 2021 2:54 AM
> To: Song Bao Hua (Barry Song) 
> Cc: David Hildenbrand ; Wangzhou (B)
> ; linux-ker...@vger.kernel.org;
> iommu@lists.linux-foundation.org; linux...@kvack.org;
> linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew
> Morton ; Alexander Viro ;
> gre...@linuxfoundation.org; kevin.t...@intel.com; jean-phili...@linaro.org;
> eric.au...@redhat.com; Liguozhu (Kenneth) ;
> zhangfei@linaro.org; chensihang (A) 
> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
> pin
> 
> On Tue, Feb 09, 2021 at 03:01:42AM +, Song Bao Hua (Barry Song) wrote:
> 
> > On the other hand, wouldn't it be the benefit of hardware accelerators
> > to have a lower and more stable latency zip/encryption than CPU?
> 
> No, I don't think so.

Fortunately or unfortunately, I think my people have this target to have
a lower-latency and more stable zip/encryption by using accelerators,
otherwise, they are going to use CPU directly if there is no advantage
of accelerators.

> 
> If this is an important problem then it should apply equally to CPU
> and IO jitter.
> 
> Honestly I find the idea that occasional migration jitters CPU and DMA
> to not be very compelling. Such specialized applications should
> allocate special pages to avoid this, not adding an API to be able to
> lock down any page

That is exactly what we have done to provide a hugeTLB pool so that
applications can allocate memory from this pool.

+---+
 |   |
 |applications using accelerators|
 +---+


 alloc from pool free to pool
   +  ++
   |   |
   |   |
   |   |
   |   |
   |   |
   |   |
   |   |
+--+---+-+
||
||
|  HugeTLB memory pool   |
||
||
++

The problem is that SVA declares we can use any memory of a process
to do I/O. And in real scenarios, we are unable to customize most
applications to make them use the pool. So we are looking for some
extension generically for applications such as Nginx, Ceph.

I am also thinking about leveraging vm.compact_unevictable_allowed
which David suggested and making an extension on it, for example,
permit users to disable compaction and numa balancing on unevictable
pages of SVA process,  which might be a smaller deal.

> 
> Jason

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin

2021-02-08 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Jason Gunthorpe [mailto:j...@ziepe.ca]
> Sent: Tuesday, February 9, 2021 10:30 AM
> To: Song Bao Hua (Barry Song) 
> Cc: David Hildenbrand ; Wangzhou (B)
> ; linux-ker...@vger.kernel.org;
> iommu@lists.linux-foundation.org; linux...@kvack.org;
> linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew
> Morton ; Alexander Viro ;
> gre...@linuxfoundation.org; kevin.t...@intel.com; jean-phili...@linaro.org;
> eric.au...@redhat.com; Liguozhu (Kenneth) ;
> zhangfei@linaro.org; chensihang (A) 
> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
> pin
> 
> On Mon, Feb 08, 2021 at 08:35:31PM +, Song Bao Hua (Barry Song) wrote:
> >
> >
> > > From: Jason Gunthorpe [mailto:j...@ziepe.ca]
> > > Sent: Tuesday, February 9, 2021 7:34 AM
> > > To: David Hildenbrand 
> > > Cc: Wangzhou (B) ; linux-ker...@vger.kernel.org;
> > > iommu@lists.linux-foundation.org; linux...@kvack.org;
> > > linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew
> > > Morton ; Alexander Viro
> ;
> > > gre...@linuxfoundation.org; Song Bao Hua (Barry Song)
> > > ; kevin.t...@intel.com;
> > > jean-phili...@linaro.org; eric.au...@redhat.com; Liguozhu (Kenneth)
> > > ; zhangfei@linaro.org; chensihang (A)
> > > 
> > > Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide 
> > > memory
> > > pin
> > >
> > > On Mon, Feb 08, 2021 at 09:14:28AM +0100, David Hildenbrand wrote:
> > >
> > > > People are constantly struggling with the effects of long term pinnings
> > > > under user space control, like we already have with vfio and RDMA.
> > > >
> > > > And here we are, adding yet another, easier way to mess with core MM in
> the
> > > > same way. This feels like a step backwards to me.
> > >
> > > Yes, this seems like a very poor candidate to be a system call in this
> > > format. Much too narrow, poorly specified, and possibly security
> > > implications to allow any process whatsoever to pin memory.
> > >
> > > I keep encouraging people to explore a standard shared SVA interface
> > > that can cover all these topics (and no, uaccel is not that
> > > interface), that seems much more natural.
> > >
> > > I still haven't seen an explanation why DMA is so special here,
> > > migration and so forth jitter the CPU too, environments that care
> > > about jitter have to turn this stuff off.
> >
> > This paper has a good explanation:
> > https://ieeexplore.ieee.org/stamp/stamp.jsp?tp==7482091
> >
> > mainly because page fault can go directly to the CPU and we have
> > many CPUs. But IO Page Faults go a different way, thus mean much
> > higher latency 3-80x slower than page fault:
> > events in hardware queue -> Interrupts -> cpu processing page fault
> > -> return events to iommu/device -> continue I/O.
> 
> The justifications for this was migration scenarios and migration is
> short. If you take a fault on what you are migrating only then does it
> slow down the CPU.

I agree this can slow down CPU, but not as much as IO page fault.

On the other hand, wouldn't it be the benefit of hardware accelerators
to have a lower and more stable latency zip/encryption than CPU?

> 
> Are you also working with HW where the IOMMU becomes invalidated after
> a migration and doesn't reload?
> 
> ie not true SVA but the sort of emulated SVA we see in a lot of
> places?

Yes. It is true SVA not emulated SVA.

> 
> It would be much better to work improve that to have closer sync with the
> CPU page table than to use pinning.

Absolutely I agree improving IOPF and making IOPF catch up with the 
performance of page fault is the best way. but it would take much
long time to optimize both HW and SW. While waiting for them to
mature, probably some way which can minimize IOPF should be used to
take the responsivity.

> 
> Jason

Thanks
Barry
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin

2021-02-08 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: David Hildenbrand [mailto:da...@redhat.com]
> Sent: Monday, February 8, 2021 11:37 PM
> To: Song Bao Hua (Barry Song) ; Matthew Wilcox
> 
> Cc: Wangzhou (B) ; linux-ker...@vger.kernel.org;
> iommu@lists.linux-foundation.org; linux...@kvack.org;
> linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew
> Morton ; Alexander Viro ;
> gre...@linuxfoundation.org; j...@ziepe.ca; kevin.t...@intel.com;
> jean-phili...@linaro.org; eric.au...@redhat.com; Liguozhu (Kenneth)
> ; zhangfei@linaro.org; chensihang (A)
> 
> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
> pin
> 
> On 08.02.21 11:13, Song Bao Hua (Barry Song) wrote:
> >
> >
> >> -Original Message-
> >> From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On Behalf
> Of
> >> David Hildenbrand
> >> Sent: Monday, February 8, 2021 9:22 PM
> >> To: Song Bao Hua (Barry Song) ; Matthew Wilcox
> >> 
> >> Cc: Wangzhou (B) ; linux-ker...@vger.kernel.org;
> >> iommu@lists.linux-foundation.org; linux...@kvack.org;
> >> linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew
> >> Morton ; Alexander Viro
> ;
> >> gre...@linuxfoundation.org; j...@ziepe.ca; kevin.t...@intel.com;
> >> jean-phili...@linaro.org; eric.au...@redhat.com; Liguozhu (Kenneth)
> >> ; zhangfei@linaro.org; chensihang (A)
> >> 
> >> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
> >> pin
> >>
> >> On 08.02.21 03:27, Song Bao Hua (Barry Song) wrote:
> >>>
> >>>
> >>>> -Original Message-
> >>>> From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On
> Behalf
> >> Of
> >>>> Matthew Wilcox
> >>>> Sent: Monday, February 8, 2021 2:31 PM
> >>>> To: Song Bao Hua (Barry Song) 
> >>>> Cc: Wangzhou (B) ;
> linux-ker...@vger.kernel.org;
> >>>> iommu@lists.linux-foundation.org; linux...@kvack.org;
> >>>> linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew
> >>>> Morton ; Alexander Viro
> >> ;
> >>>> gre...@linuxfoundation.org; j...@ziepe.ca; kevin.t...@intel.com;
> >>>> jean-phili...@linaro.org; eric.au...@redhat.com; Liguozhu (Kenneth)
> >>>> ; zhangfei@linaro.org; chensihang (A)
> >>>> 
> >>>> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide 
> >>>> memory
> >>>> pin
> >>>>
> >>>> On Sun, Feb 07, 2021 at 10:24:28PM +, Song Bao Hua (Barry Song) 
> >>>> wrote:
> >>>>>>> In high-performance I/O cases, accelerators might want to perform
> >>>>>>> I/O on a memory without IO page faults which can result in 
> >>>>>>> dramatically
> >>>>>>> increased latency. Current memory related APIs could not achieve this
> >>>>>>> requirement, e.g. mlock can only avoid memory to swap to backup 
> >>>>>>> device,
> >>>>>>> page migration can still trigger IO page fault.
> >>>>>>
> >>>>>> Well ... we have two requirements.  The application wants to not take
> >>>>>> page faults.  The system wants to move the application to a different
> >>>>>> NUMA node in order to optimise overall performance.  Why should the
> >>>>>> application's desires take precedence over the kernel's desires?  And
> why
> >>>>>> should it be done this way rather than by the sysadmin using numactl
> to
> >>>>>> lock the application to a particular node?
> >>>>>
> >>>>> NUMA balancer is just one of many reasons for page migration. Even one
> >>>>> simple alloc_pages() can cause memory migration in just single NUMA
> >>>>> node or UMA system.
> >>>>>
> >>>>> The other reasons for page migration include but are not limited to:
> >>>>> * memory move due to CMA
> >>>>> * memory move due to huge pages creation
> >>>>>
> >>>>> Hardly we can ask users to disable the COMPACTION, CMA and Huge Page
> >>>>> in the whole system.
> >>>>
> >>>> You're dodging the question.  Should the CMA allocation fail because
> >>>> another application is usin

RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin

2021-02-08 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Jason Gunthorpe [mailto:j...@ziepe.ca]
> Sent: Tuesday, February 9, 2021 7:34 AM
> To: David Hildenbrand 
> Cc: Wangzhou (B) ; linux-ker...@vger.kernel.org;
> iommu@lists.linux-foundation.org; linux...@kvack.org;
> linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew
> Morton ; Alexander Viro ;
> gre...@linuxfoundation.org; Song Bao Hua (Barry Song)
> ; kevin.t...@intel.com;
> jean-phili...@linaro.org; eric.au...@redhat.com; Liguozhu (Kenneth)
> ; zhangfei@linaro.org; chensihang (A)
> 
> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
> pin
> 
> On Mon, Feb 08, 2021 at 09:14:28AM +0100, David Hildenbrand wrote:
> 
> > People are constantly struggling with the effects of long term pinnings
> > under user space control, like we already have with vfio and RDMA.
> >
> > And here we are, adding yet another, easier way to mess with core MM in the
> > same way. This feels like a step backwards to me.
> 
> Yes, this seems like a very poor candidate to be a system call in this
> format. Much too narrow, poorly specified, and possibly security
> implications to allow any process whatsoever to pin memory.
> 
> I keep encouraging people to explore a standard shared SVA interface
> that can cover all these topics (and no, uaccel is not that
> interface), that seems much more natural.
> 
> I still haven't seen an explanation why DMA is so special here,
> migration and so forth jitter the CPU too, environments that care
> about jitter have to turn this stuff off.

This paper has a good explanation:
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp==7482091

mainly because page fault can go directly to the CPU and we have
many CPUs. But IO Page Faults go a different way, thus mean much
higher latency 3-80x slower than page fault:
events in hardware queue -> Interrupts -> cpu processing page fault
-> return events to iommu/device -> continue I/O.

Copied from the paper:

If the IOMMU's page table walker fails to find the desired
translation in the page table, it sends an ATS response to
the GPU notifying it of this failure. This in turn corresponds
to a page fault. In response, the GPU sends another request to
the IOMMU called a Peripheral Page Request (PPR). The IOMMU
places this request in a memory-mapped queue and raises an
interrupt on the CPU. Multiple PPR requests can be queued
before the CPU is interrupted. The OS must have a suitable
IOMMU driver to process this interrupt and the queued PPR
requests. In Linux, while in an interrupt context, the driver
pulls PPR requests from the queue and places them in a work-queue
for later processing. Presumably this design decision was made
to minimize the time spent executing in an interrupt context,
where lower priority interrupts would be dis-abled. At a later
time, an OS worker-thread calls back into the driver to process
page fault requests in the work-queue. Once the requests are
serviced, the driver notifies the IOMMU. In turn, the IOMMU
notifies the GPU. The GPU then sends an-other ATS request to
retry the translation for the original fault-ing address.

Comparison with CPU: On the CPU, a hardware excep-tion is
raised on a page fault, which immediately switches to the
OS. In most cases in Linux, this routine services the page
fault directly, instead of queuing it for later processing.
Con-trast this with a page fault from an accelerator, where
the IOMMU has to interrupt the CPU to request service on
its be-half, and also note the several back-and-forth messages
be-tween the accelerator, the IOMMU, and the CPU. Further-more,
page faults on the CPU are generally handled one at a time
on the CPU, while for the GPU they are batched by the IOMMU
and OS work-queue mechanism.

> 
> Jason

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin

2021-02-08 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On Behalf Of
> David Hildenbrand
> Sent: Monday, February 8, 2021 9:22 PM
> To: Song Bao Hua (Barry Song) ; Matthew Wilcox
> 
> Cc: Wangzhou (B) ; linux-ker...@vger.kernel.org;
> iommu@lists.linux-foundation.org; linux...@kvack.org;
> linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew
> Morton ; Alexander Viro ;
> gre...@linuxfoundation.org; j...@ziepe.ca; kevin.t...@intel.com;
> jean-phili...@linaro.org; eric.au...@redhat.com; Liguozhu (Kenneth)
> ; zhangfei@linaro.org; chensihang (A)
> 
> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
> pin
> 
> On 08.02.21 03:27, Song Bao Hua (Barry Song) wrote:
> >
> >
> >> -Original Message-
> >> From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On Behalf
> Of
> >> Matthew Wilcox
> >> Sent: Monday, February 8, 2021 2:31 PM
> >> To: Song Bao Hua (Barry Song) 
> >> Cc: Wangzhou (B) ; linux-ker...@vger.kernel.org;
> >> iommu@lists.linux-foundation.org; linux...@kvack.org;
> >> linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew
> >> Morton ; Alexander Viro
> ;
> >> gre...@linuxfoundation.org; j...@ziepe.ca; kevin.t...@intel.com;
> >> jean-phili...@linaro.org; eric.au...@redhat.com; Liguozhu (Kenneth)
> >> ; zhangfei@linaro.org; chensihang (A)
> >> 
> >> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
> >> pin
> >>
> >> On Sun, Feb 07, 2021 at 10:24:28PM +, Song Bao Hua (Barry Song) wrote:
> >>>>> In high-performance I/O cases, accelerators might want to perform
> >>>>> I/O on a memory without IO page faults which can result in dramatically
> >>>>> increased latency. Current memory related APIs could not achieve this
> >>>>> requirement, e.g. mlock can only avoid memory to swap to backup device,
> >>>>> page migration can still trigger IO page fault.
> >>>>
> >>>> Well ... we have two requirements.  The application wants to not take
> >>>> page faults.  The system wants to move the application to a different
> >>>> NUMA node in order to optimise overall performance.  Why should the
> >>>> application's desires take precedence over the kernel's desires?  And why
> >>>> should it be done this way rather than by the sysadmin using numactl to
> >>>> lock the application to a particular node?
> >>>
> >>> NUMA balancer is just one of many reasons for page migration. Even one
> >>> simple alloc_pages() can cause memory migration in just single NUMA
> >>> node or UMA system.
> >>>
> >>> The other reasons for page migration include but are not limited to:
> >>> * memory move due to CMA
> >>> * memory move due to huge pages creation
> >>>
> >>> Hardly we can ask users to disable the COMPACTION, CMA and Huge Page
> >>> in the whole system.
> >>
> >> You're dodging the question.  Should the CMA allocation fail because
> >> another application is using SVA?
> >>
> >> I would say no.
> >
> > I would say no as well.
> >
> > While IOMMU is enabled, CMA almost has one user only: IOMMU driver
> > as other drivers will depend on iommu to use non-contiguous memory
> > though they are still calling dma_alloc_coherent().
> >
> > In iommu driver, dma_alloc_coherent is called during initialization
> > and there is no new allocation afterwards. So it wouldn't cause
> > runtime impact on SVA performance. Even there is new allocations,
> > CMA will fall back to general alloc_pages() and iommu drivers are
> > almost allocating small memory for command queues.
> >
> > So I would say general compound pages, huge pages, especially
> > transparent huge pages, would be bigger concerns than CMA for
> > internal page migration within one NUMA.
> >
> > Not like CMA, general alloc_pages() can get memory by moving
> > pages other than those pinned.
> >
> > And there is no guarantee we can always bind the memory of
> > SVA applications to single one NUMA, so NUMA balancing is
> > still a concern.
> >
> > But I agree we need a way to make CMA success while the userspace
> > pages are pinned. Since pin has been viral in many drivers, I
> > assume there is a way to handle this. Otherwise, APIs like
> > V4L2_MEMORY_USERPTR[1] will

RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin

2021-02-07 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: David Rientjes [mailto:rient...@google.com]
> Sent: Monday, February 8, 2021 3:18 PM
> To: Song Bao Hua (Barry Song) 
> Cc: Matthew Wilcox ; Wangzhou (B)
> ; linux-ker...@vger.kernel.org;
> iommu@lists.linux-foundation.org; linux...@kvack.org;
> linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew
> Morton ; Alexander Viro ;
> gre...@linuxfoundation.org; j...@ziepe.ca; kevin.t...@intel.com;
> jean-phili...@linaro.org; eric.au...@redhat.com; Liguozhu (Kenneth)
> ; zhangfei@linaro.org; chensihang (A)
> 
> Subject: RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
> pin
> 
> On Sun, 7 Feb 2021, Song Bao Hua (Barry Song) wrote:
> 
> > NUMA balancer is just one of many reasons for page migration. Even one
> > simple alloc_pages() can cause memory migration in just single NUMA
> > node or UMA system.
> >
> > The other reasons for page migration include but are not limited to:
> > * memory move due to CMA
> > * memory move due to huge pages creation
> >
> > Hardly we can ask users to disable the COMPACTION, CMA and Huge Page
> > in the whole system.
> >
> 
> What about only for mlocked memory, i.e. disable
> vm.compact_unevictable_allowed?
> 
> Adding syscalls is a big deal, we can make a reasonable inference that
> we'll have to support this forever if it's merged.  I haven't seen mention
> of what other unevictable memory *should* be migratable that would be
> adversely affected if we disable that sysctl.  Maybe that gets you part of
> the way there and there are some other deficiencies, but it seems like a
> good start would be to describe how CONFIG_NUMA_BALANCING=n +
> vm.compact_unevcitable_allowed + mlock() doesn't get you mostly there and
> then look into what's missing.
> 

I believe it can resolve the performance problem for the SVA
applications if we disable vm.compact_unevcitable_allowed and
NUMA_BALANCE, and use mlock().

The problem is that it is insensible to ask users to disable
unevictable_allowed or numa balancing of the whole system
only because there is one SVA application in the system.

SVA, for itself, is a mechanism to let cpu and devices share same
address space. In a typical server system, there are many processes,
the better way would be only changing the behavior of the specific
process rather than changing the whole system. It is hard to ask
users to do that only because there is a SVA monster.
Plus, this might negatively affect those applications not using SVA.

> If it's a very compelling case where there simply are no alternatives, it
> would make sense.  Alternative is to find a more generic way, perhaps in
> combination with vm.compact_unevictable_allowed, to achieve what you're
> looking to do that can be useful even beyond your originally intended use
> case.

sensible. Actually pin is exactly the way to disable migration for specific
pages AKA. disabling "vm.compact_unevictable_allowed" on those pages.

It is hard to differentiate what pages should not be migrated. Only apps
know that as even SVA applications can allocate many non-IO pages which
should be able to move.

Thanks
Barry
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin

2021-02-07 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On Behalf Of
> Matthew Wilcox
> Sent: Monday, February 8, 2021 2:31 PM
> To: Song Bao Hua (Barry Song) 
> Cc: Wangzhou (B) ; linux-ker...@vger.kernel.org;
> iommu@lists.linux-foundation.org; linux...@kvack.org;
> linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew
> Morton ; Alexander Viro ;
> gre...@linuxfoundation.org; j...@ziepe.ca; kevin.t...@intel.com;
> jean-phili...@linaro.org; eric.au...@redhat.com; Liguozhu (Kenneth)
> ; zhangfei@linaro.org; chensihang (A)
> 
> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
> pin
> 
> On Sun, Feb 07, 2021 at 10:24:28PM +, Song Bao Hua (Barry Song) wrote:
> > > > In high-performance I/O cases, accelerators might want to perform
> > > > I/O on a memory without IO page faults which can result in dramatically
> > > > increased latency. Current memory related APIs could not achieve this
> > > > requirement, e.g. mlock can only avoid memory to swap to backup device,
> > > > page migration can still trigger IO page fault.
> > >
> > > Well ... we have two requirements.  The application wants to not take
> > > page faults.  The system wants to move the application to a different
> > > NUMA node in order to optimise overall performance.  Why should the
> > > application's desires take precedence over the kernel's desires?  And why
> > > should it be done this way rather than by the sysadmin using numactl to
> > > lock the application to a particular node?
> >
> > NUMA balancer is just one of many reasons for page migration. Even one
> > simple alloc_pages() can cause memory migration in just single NUMA
> > node or UMA system.
> >
> > The other reasons for page migration include but are not limited to:
> > * memory move due to CMA
> > * memory move due to huge pages creation
> >
> > Hardly we can ask users to disable the COMPACTION, CMA and Huge Page
> > in the whole system.
> 
> You're dodging the question.  Should the CMA allocation fail because
> another application is using SVA?
> 
> I would say no.  

I would say no as well.

While IOMMU is enabled, CMA almost has one user only: IOMMU driver
as other drivers will depend on iommu to use non-contiguous memory
though they are still calling dma_alloc_coherent().

In iommu driver, dma_alloc_coherent is called during initialization
and there is no new allocation afterwards. So it wouldn't cause
runtime impact on SVA performance. Even there is new allocations,
CMA will fall back to general alloc_pages() and iommu drivers are
almost allocating small memory for command queues.

So I would say general compound pages, huge pages, especially
transparent huge pages, would be bigger concerns than CMA for
internal page migration within one NUMA. 

Not like CMA, general alloc_pages() can get memory by moving
pages other than those pinned.

And there is no guarantee we can always bind the memory of
SVA applications to single one NUMA, so NUMA balancing is
still a concern.

But I agree we need a way to make CMA success while the userspace
pages are pinned. Since pin has been viral in many drivers, I
assume there is a way to handle this. Otherwise, APIs like 
V4L2_MEMORY_USERPTR[1] will possibly make CMA fail as there
is no guarantee that usersspace will allocate unmovable memory
and there is no guarantee the fallback path- alloc_pages() can
succeed while allocating big memory.

Will investigate more.

> The application using SVA should take the one-time
> performance hit from having its memory moved around.

Sometimes I also feel SVA is doomed to suffer from performance
impact due to page migration. But we are still trying to
extend its use cases to high-performance I/O.

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/media/v4l2-core/videobuf-dma-sg.c

Thanks
Barry
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin

2021-02-07 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Matthew Wilcox [mailto:wi...@infradead.org]
> Sent: Monday, February 8, 2021 10:34 AM
> To: Wangzhou (B) 
> Cc: linux-ker...@vger.kernel.org; iommu@lists.linux-foundation.org;
> linux...@kvack.org; linux-arm-ker...@lists.infradead.org;
> linux-...@vger.kernel.org; Andrew Morton ;
> Alexander Viro ; gre...@linuxfoundation.org; Song
> Bao Hua (Barry Song) ; j...@ziepe.ca;
> kevin.t...@intel.com; jean-phili...@linaro.org; eric.au...@redhat.com;
> Liguozhu (Kenneth) ; zhangfei@linaro.org;
> chensihang (A) 
> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
> pin
> 
> On Sun, Feb 07, 2021 at 04:18:03PM +0800, Zhou Wang wrote:
> > SVA(share virtual address) offers a way for device to share process virtual
> > address space safely, which makes more convenient for user space device
> > driver coding. However, IO page faults may happen when doing DMA
> > operations. As the latency of IO page fault is relatively big, DMA
> > performance will be affected severely when there are IO page faults.
> > >From a long term view, DMA performance will be not stable.
> >
> > In high-performance I/O cases, accelerators might want to perform
> > I/O on a memory without IO page faults which can result in dramatically
> > increased latency. Current memory related APIs could not achieve this
> > requirement, e.g. mlock can only avoid memory to swap to backup device,
> > page migration can still trigger IO page fault.
> 
> Well ... we have two requirements.  The application wants to not take
> page faults.  The system wants to move the application to a different
> NUMA node in order to optimise overall performance.  Why should the
> application's desires take precedence over the kernel's desires?  And why
> should it be done this way rather than by the sysadmin using numactl to
> lock the application to a particular node?

NUMA balancer is just one of many reasons for page migration. Even one
simple alloc_pages() can cause memory migration in just single NUMA
node or UMA system.

The other reasons for page migration include but are not limited to:
* memory move due to CMA
* memory move due to huge pages creation

Hardly we can ask users to disable the COMPACTION, CMA and Huge Page
in the whole system.

On the other hand, numactl doesn't always bind memory to single NUMA
node, sometimes, while applications require many cpu, it could bind
more than one memory node.

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v2] dma-mapping: benchmark: pretend DMA is transmitting

2021-02-05 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Christoph Hellwig [mailto:h...@lst.de]
> Sent: Friday, February 5, 2021 11:36 PM
> To: Song Bao Hua (Barry Song) 
> Cc: Christoph Hellwig ; m.szyprow...@samsung.com;
> robin.mur...@arm.com; iommu@lists.linux-foundation.org;
> linux-ker...@vger.kernel.org; linux...@openeuler.org
> Subject: Re: [PATCH v2] dma-mapping: benchmark: pretend DMA is transmitting
> 
> On Fri, Feb 05, 2021 at 10:32:26AM +, Song Bao Hua (Barry Song) wrote:
> > I can keep the struct size unchanged by changing the struct to
> >
> > struct map_benchmark {
> > __u64 avg_map_100ns; /* average map latency in 100ns */
> > __u64 map_stddev; /* standard deviation of map latency */
> > __u64 avg_unmap_100ns; /* as above */
> > __u64 unmap_stddev;
> > __u32 threads; /* how many threads will do map/unmap in parallel */
> > __u32 seconds; /* how long the test will last */
> > __s32 node; /* which numa node this benchmark will run on */
> > __u32 dma_bits; /* DMA addressing capability */
> > __u32 dma_dir; /* DMA data direction */
> > __u32 dma_trans_ns; /* time for DMA transmission in ns */
> >
> > __u32 exp; /* For future use */
> > __u64 expansion[9]; /* For future use */
> > };
> >
> > But the code is really ugly now.
> 
> Thats why we usually use __u8 fields for reserved field.  You might
> consider just switching to that instead while you're at it. I guess
> we'll just have to get the addition into 5.11 then to make sure we
> don't release a kernel with the alignment fix.

I assume there is no need to keep the same size with 5.11-rc, so
could change the struct to:

struct map_benchmark {
__u64 avg_map_100ns; /* average map latency in 100ns */
__u64 map_stddev; /* standard deviation of map latency */
__u64 avg_unmap_100ns; /* as above */
__u64 unmap_stddev;
__u32 threads; /* how many threads will do map/unmap in parallel */
__u32 seconds; /* how long the test will last */
__s32 node; /* which numa node this benchmark will run on */
__u32 dma_bits; /* DMA addressing capability */
__u32 dma_dir; /* DMA data direction */
__u8 expansion[84]; /* For future use */
};

This won't increase size on 64bit system, but it increases 4bytes
on 32bits system comparing to 5.11-rc. How do you think about it?

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v2] dma-mapping: benchmark: pretend DMA is transmitting

2021-02-05 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Christoph Hellwig [mailto:h...@lst.de]
> Sent: Friday, February 5, 2021 10:21 PM
> To: Song Bao Hua (Barry Song) 
> Cc: m.szyprow...@samsung.com; h...@lst.de; robin.mur...@arm.com;
> iommu@lists.linux-foundation.org; linux-ker...@vger.kernel.org;
> linux...@openeuler.org
> Subject: Re: [PATCH v2] dma-mapping: benchmark: pretend DMA is transmitting
> 
> On Fri, Feb 05, 2021 at 03:00:35PM +1300, Barry Song wrote:
> > +   __u32 dma_trans_ns; /* time for DMA transmission in ns */
> > __u64 expansion[10];/* For future use */
> 
> We need to keep the struct size, so the expansion field needs to
> shrink by the equivalent amount of data that is added in dma_trans_ns.

Unfortunately I didn't put a rsv u32 field after dma_dir
in the original patch.
There were five 32bits data before expansion[]:

struct map_benchmark {
__u64 avg_map_100ns; /* average map latency in 100ns */
__u64 map_stddev; /* standard deviation of map latency */
__u64 avg_unmap_100ns; /* as above */
__u64 unmap_stddev;
__u32 threads; /* how many threads will do map/unmap in parallel */
__u32 seconds; /* how long the test will last */
__s32 node; /* which numa node this benchmark will run on */
__u32 dma_bits; /* DMA addressing capability */
__u32 dma_dir; /* DMA data direction */
__u64 expansion[10];/* For future use */
};

My bad. That was really silly. I should have done the below from
the first beginning:
struct map_benchmark {
__u64 avg_map_100ns; /* average map latency in 100ns */
__u64 map_stddev; /* standard deviation of map latency */
__u64 avg_unmap_100ns; /* as above */
__u64 unmap_stddev;
__u32 threads; /* how many threads will do map/unmap in parallel */
__u32 seconds; /* how long the test will last */
__s32 node; /* which numa node this benchmark will run on */
__u32 dma_bits; /* DMA addressing capability */
__u32 dma_dir; /* DMA data direction */
__u32 rsv;
__u64 expansion[10];/* For future use */
};

So on 64bit system, this patch doesn't change the length of struct
as the new added u32 just fill the gap between dma_dir and expansion.

For 32bit system, this patch increases 4 bytes in the length.

I can keep the struct size unchanged by changing the struct to

struct map_benchmark {
__u64 avg_map_100ns; /* average map latency in 100ns */
__u64 map_stddev; /* standard deviation of map latency */
__u64 avg_unmap_100ns; /* as above */
__u64 unmap_stddev;
__u32 threads; /* how many threads will do map/unmap in parallel */
__u32 seconds; /* how long the test will last */
__s32 node; /* which numa node this benchmark will run on */
__u32 dma_bits; /* DMA addressing capability */
__u32 dma_dir; /* DMA data direction */
__u32 dma_trans_ns; /* time for DMA transmission in ns */

__u32 exp; /* For future use */
__u64 expansion[9]; /* For future use */
};

But the code is really ugly now.

Thanks
Barry
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH] dma-mapping: benchmark: pretend DMA is transmitting

2021-02-04 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Robin Murphy [mailto:robin.mur...@arm.com]
> Sent: Friday, February 5, 2021 12:51 PM
> To: Song Bao Hua (Barry Song) ;
> m.szyprow...@samsung.com; h...@lst.de; iommu@lists.linux-foundation.org
> Cc: linux-ker...@vger.kernel.org; linux...@openeuler.org
> Subject: Re: [PATCH] dma-mapping: benchmark: pretend DMA is transmitting
> 
> On 2021-02-04 22:58, Barry Song wrote:
> > In a real dma mapping user case, after dma_map is done, data will be
> > transmit. Thus, in multi-threaded user scenario, IOMMU contention
> > should not be that severe. For example, if users enable multiple
> > threads to send network packets through 1G/10G/100Gbps NIC, usually
> > the steps will be: map -> transmission -> unmap.  Transmission delay
> > reduces the contention of IOMMU. Here a delay is added to simulate
> > the transmission for TX case so that the tested result could be
> > more accurate.
> >
> > RX case would be much more tricky. It is not supported yet.
> 
> I guess it might be a reasonable approximation to map several pages,
> then unmap them again after a slightly more random delay. Or maybe
> divide the threads into pairs of mappers and unmappers respectively
> filling up and draining proper little buffer pools.

Yes. Good suggestions. I am actually thinking about how to support
cases like networks. There is a pre-mapped list of pages, each page
is bound with some hardware DMA block descriptor(BD). So if Linux can
consume the packets in time, those buffers are always re-used. Only
when the page bound with BD is full and OS can't consume it in time,
another temp page will be allocated and mapped, BD will switch to use
this temp page, then finally unmap it if it is not needed any more.
On the other hand, the pre-mapped pages are never unmapped.

For things like filesystem and disk driver, RX is always requested by
users. The model would be simpler: map -> rx -> unmap. For networks,
RX transmission can come spontaneously.

Anyway, I'll put this into TBD. For this moment, mainly handle TX path.
Or maybe the current code has been able to handle simple RX model :-)

> 
> > Signed-off-by: Barry Song 
> > ---
> >   kernel/dma/map_benchmark.c  | 11 +++
> >   tools/testing/selftests/dma/dma_map_benchmark.c | 17 +++--
> >   2 files changed, 26 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c
> > index 1b1b8ff875cb..1976db7e34e4 100644
> > --- a/kernel/dma/map_benchmark.c
> > +++ b/kernel/dma/map_benchmark.c
> > @@ -21,6 +21,7 @@
> >   #define DMA_MAP_BENCHMARK _IOWR('d', 1, struct map_benchmark)
> >   #define DMA_MAP_MAX_THREADS   1024
> >   #define DMA_MAP_MAX_SECONDS   300
> > +#define DMA_MAP_MAX_TRANS_DELAY(10 * 1000 * 1000) /* 10ms */
> 
> Using MSEC_PER_SEC might be sufficiently self-documenting?

Yes, I guess you mean NSEC_PER_MSEC. will move to it.

> 
> >   #define DMA_MAP_BIDIRECTIONAL 0
> >   #define DMA_MAP_TO_DEVICE 1
> > @@ -36,6 +37,7 @@ struct map_benchmark {
> > __s32 node; /* which numa node this benchmark will run on */
> > __u32 dma_bits; /* DMA addressing capability */
> > __u32 dma_dir; /* DMA data direction */
> > +   __u32 dma_trans_ns; /* time for DMA transmission in ns */
> > __u64 expansion[10];/* For future use */
> >   };
> >
> > @@ -87,6 +89,10 @@ static int map_benchmark_thread(void *data)
> > map_etime = ktime_get();
> > map_delta = ktime_sub(map_etime, map_stime);
> >
> > +   /* Pretend DMA is transmitting */
> > +   if (map->dir != DMA_FROM_DEVICE)
> > +   ndelay(map->bparam.dma_trans_ns);
> 
> TBH I think the option of a fixed delay between map and unmap might be a
> handy thing in general, so having the direction check at all seems
> needlessly restrictive. As long as the driver implements all the basic
> building blocks, combining them to simulate specific traffic patterns
> can be left up to the benchmark tool.

Sensible, will remove the condition check.

> 
> Robin.
> 
> > +
> > unmap_stime = ktime_get();
> > dma_unmap_single(map->dev, dma_addr, PAGE_SIZE, map->dir);
> > unmap_etime = ktime_get();
> > @@ -218,6 +224,11 @@ static long map_benchmark_ioctl(struct file *file,
> unsigned int cmd,
> > return -EINVAL;
> > }
> >
> > +   if (map->bparam.dma_trans_ns > DMA_MAP_MAX_TRANS_DELAY) {
> > +   pr_err("invalid transmission delay\n&q

RE: [RFC PATCH v2] uacce: Add uacce_ctrl misc device

2021-02-01 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Tian, Kevin [mailto:kevin.t...@intel.com]
> Sent: Tuesday, February 2, 2021 3:52 PM
> To: Jason Gunthorpe 
> Cc: Song Bao Hua (Barry Song) ; chensihang (A)
> ; Arnd Bergmann ; Greg
> Kroah-Hartman ; linux-ker...@vger.kernel.org;
> iommu@lists.linux-foundation.org; linux...@kvack.org; Zhangfei Gao
> ; Liguozhu (Kenneth) ;
> linux-accelerat...@lists.ozlabs.org
> Subject: RE: [RFC PATCH v2] uacce: Add uacce_ctrl misc device
> 
> > From: Jason Gunthorpe 
> > Sent: Tuesday, February 2, 2021 7:44 AM
> >
> > On Fri, Jan 29, 2021 at 10:09:03AM +, Tian, Kevin wrote:
> > > > SVA is not doom to work with IO page fault only. If we have SVA+pin,
> > > > we would get both sharing address and stable I/O latency.
> > >
> > > Isn't it like a traditional MAP_DMA API (imply pinning) plus specifying
> > > cpu_va of the memory pool as the iova?
> >
> > I think their issue is the HW can't do the cpu_va trick without also
> > involving the system IOMMU in a SVA mode
> >
> 
> This is the part that I didn't understand. Using cpu_va in a MAP_DMA
> interface doesn't require device support. It's just an user-specified
> address to be mapped into the IOMMU page table. On the other hand,

The background is that uacce is based on SVA and we are building
applications on uacce:
https://www.kernel.org/doc/html/v5.10/misc-devices/uacce.html
so IOMMU simply uses the page table of MMU, and don't do any
special mapping to an user-specified address. We don't break
the basic assumption that uacce is using SVA, otherwise, we
need to re-build uacce and the whole base.

> sharing CPU page table through a SVA interface for an usage where I/O
> page faults must be completely avoided seems a misleading attempt.

That is not for completely avoiding IO page fault, that is just
an extension for high-performance I/O case, providing a way to
avoid IO latency jitter. Using it or not is totally up to users.

> Even if people do want this model (e.g. mix pinning+fault), it should be
> a mm syscall as Greg pointed out, not specific to sva.
> 

We are glad to make it a syscall if people are happy with
it. The simplest way would be a syscall similar with
userfaultfd  if we don't want to mess up mm_struct.

> Thanks
> Kevin

Thanks
Barry
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC PATCH v2] uacce: Add uacce_ctrl misc device

2021-02-01 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Jason Gunthorpe [mailto:j...@ziepe.ca]
> Sent: Tuesday, February 2, 2021 12:44 PM
> To: Tian, Kevin 
> Cc: Song Bao Hua (Barry Song) ; chensihang (A)
> ; Arnd Bergmann ; Greg
> Kroah-Hartman ; linux-ker...@vger.kernel.org;
> iommu@lists.linux-foundation.org; linux...@kvack.org; Zhangfei Gao
> ; Liguozhu (Kenneth) ;
> linux-accelerat...@lists.ozlabs.org
> Subject: Re: [RFC PATCH v2] uacce: Add uacce_ctrl misc device
> 
> On Fri, Jan 29, 2021 at 10:09:03AM +, Tian, Kevin wrote:
> > > SVA is not doom to work with IO page fault only. If we have SVA+pin,
> > > we would get both sharing address and stable I/O latency.
> >
> > Isn't it like a traditional MAP_DMA API (imply pinning) plus specifying
> > cpu_va of the memory pool as the iova?
> 
> I think their issue is the HW can't do the cpu_va trick without also
> involving the system IOMMU in a SVA mode
> 
> It really is something that belongs under some general /dev/sva as we
> talked on the vfio thread

AFAIK, there is no this /dev/sva so /dev/uacce is an uAPI
which belongs to sva.

Another option is that we add a system call like
fs/userfaultfd.c, and move the file_operations and  ioctl
to the anon inode by creating fd via anon_inode_getfd().
Then nothing will be buried by uacce.

> 
> Jason

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC PATCH v2] uacce: Add uacce_ctrl misc device

2021-01-29 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Tian, Kevin [mailto:kevin.t...@intel.com]
> Sent: Friday, January 29, 2021 11:09 PM
> To: Song Bao Hua (Barry Song) ; Jason Gunthorpe
> 
> Cc: chensihang (A) ; Arnd Bergmann
> ; Greg Kroah-Hartman ;
> linux-ker...@vger.kernel.org; iommu@lists.linux-foundation.org;
> linux...@kvack.org; Zhangfei Gao ; Liguozhu
> (Kenneth) ; linux-accelerat...@lists.ozlabs.org
> Subject: RE: [RFC PATCH v2] uacce: Add uacce_ctrl misc device
> 
> > From: Song Bao Hua (Barry Song)
> > Sent: Tuesday, January 26, 2021 9:27 AM
> >
> > > -Original Message-
> > > From: Jason Gunthorpe [mailto:j...@ziepe.ca]
> > > Sent: Tuesday, January 26, 2021 2:13 PM
> > > To: Song Bao Hua (Barry Song) 
> > > Cc: Wangzhou (B) ; Greg Kroah-Hartman
> > > ; Arnd Bergmann ;
> > Zhangfei Gao
> > > ; linux-accelerat...@lists.ozlabs.org;
> > > linux-ker...@vger.kernel.org; iommu@lists.linux-foundation.org;
> > > linux...@kvack.org; Liguozhu (Kenneth) ;
> > chensihang
> > > (A) 
> > > Subject: Re: [RFC PATCH v2] uacce: Add uacce_ctrl misc device
> > >
> > > On Mon, Jan 25, 2021 at 11:35:22PM +, Song Bao Hua (Barry Song)
> > wrote:
> > >
> > > > > On Mon, Jan 25, 2021 at 10:21:14PM +, Song Bao Hua (Barry Song)
> > wrote:
> > > > > > mlock, while certainly be able to prevent swapping out, it won't
> > > > > > be able to stop page moving due to:
> > > > > > * memory compaction in alloc_pages()
> > > > > > * making huge pages
> > > > > > * numa balance
> > > > > > * memory compaction in CMA
> > > > >
> > > > > Enabling those things is a major reason to have SVA device in the
> > > > > first place, providing a SW API to turn it all off seems like the
> > > > > wrong direction.
> > > >
> > > > I wouldn't say this is a major reason to have SVA. If we read the
> > > > history of SVA and papers, people would think easy programming due
> > > > to data struct sharing between cpu and device, and process space
> > > > isolation in device would be the major reasons for SVA. SVA also
> > > > declares it supports zero-copy while zero-copy doesn't necessarily
> > > > depend on SVA.
> > >
> > > Once you have to explicitly make system calls to declare memory under
> > > IO, you loose all of that.
> > >
> > > Since you've asked the app to be explicit about the DMAs it intends to
> > > do, there is not really much reason to use SVA for those DMAs anymore.
> >
> > Let's see a non-SVA case. We are not using SVA, we can have
> > a memory pool by hugetlb or pin, and app can allocate memory
> > from this pool, and get stable I/O performance on the memory
> > from the pool. But device has its separate page table which
> > is not bound with this process, thus lacking the protection
> > of process space isolation. Plus, CPU and device are using
> > different address.
> >
> > And then we move to SVA case, we can still have a memory pool
> > by hugetlb or pin, and app can allocate memory from this pool
> > since this pool is mapped to the address space of the process,
> > and we are able to get stable I/O performance since it is always
> > there. But in this case, device is using the page table of
> > process with the full permission control.
> > And they are using same address and can possibly enjoy the easy
> > programming if HW supports.
> >
> > SVA is not doom to work with IO page fault only. If we have SVA+pin,
> > we would get both sharing address and stable I/O latency.
> >
> 
> Isn't it like a traditional MAP_DMA API (imply pinning) plus specifying
> cpu_va of the memory pool as the iova?

I think it enjoys the advantage of stable I/O latency of
traditional MAP_DMA, and also uses the process page table
which SVA can provide. The major difference is that in
SVA case, iova totally belongs to process and is as normal
as other heap/stack/data:
p = mmap(.MAP_ANON);
ioctl(/dev/acc, p, PIN);

SVA for itself, provides the ability to guarantee the
address space isolation of multiple processes.  If the
device can access the data struct  such as list, tree
directly, they can further enjoy the convenience of
programming SVA gives.

So we are looking for a combination of stable io latency
of traditional DMA map and the ability of SVA.

> 
> Thanks
> Kevin

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC PATCH v2] uacce: Add uacce_ctrl misc device

2021-01-27 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Jason Gunthorpe [mailto:j...@ziepe.ca]
> Sent: Wednesday, January 27, 2021 7:20 AM
> To: Song Bao Hua (Barry Song) 
> Cc: Wangzhou (B) ; Greg Kroah-Hartman
> ; Arnd Bergmann ; Zhangfei Gao
> ; linux-accelerat...@lists.ozlabs.org;
> linux-ker...@vger.kernel.org; iommu@lists.linux-foundation.org;
> linux...@kvack.org; Liguozhu (Kenneth) ; chensihang
> (A) 
> Subject: Re: [RFC PATCH v2] uacce: Add uacce_ctrl misc device
> 
> On Tue, Jan 26, 2021 at 01:26:45AM +, Song Bao Hua (Barry Song) wrote:
> > > On Mon, Jan 25, 2021 at 11:35:22PM +, Song Bao Hua (Barry Song) wrote:
> > >
> > > > > On Mon, Jan 25, 2021 at 10:21:14PM +, Song Bao Hua (Barry Song)
> wrote:
> > > > > > mlock, while certainly be able to prevent swapping out, it won't
> > > > > > be able to stop page moving due to:
> > > > > > * memory compaction in alloc_pages()
> > > > > > * making huge pages
> > > > > > * numa balance
> > > > > > * memory compaction in CMA
> > > > >
> > > > > Enabling those things is a major reason to have SVA device in the
> > > > > first place, providing a SW API to turn it all off seems like the
> > > > > wrong direction.
> > > >
> > > > I wouldn't say this is a major reason to have SVA. If we read the
> > > > history of SVA and papers, people would think easy programming due
> > > > to data struct sharing between cpu and device, and process space
> > > > isolation in device would be the major reasons for SVA. SVA also
> > > > declares it supports zero-copy while zero-copy doesn't necessarily
> > > > depend on SVA.
> > >
> > > Once you have to explicitly make system calls to declare memory under
> > > IO, you loose all of that.
> > >
> > > Since you've asked the app to be explicit about the DMAs it intends to
> > > do, there is not really much reason to use SVA for those DMAs anymore.
> >
> > Let's see a non-SVA case. We are not using SVA, we can have
> > a memory pool by hugetlb or pin, and app can allocate memory
> > from this pool, and get stable I/O performance on the memory
> > from the pool. But device has its separate page table which
> > is not bound with this process, thus lacking the protection
> > of process space isolation. Plus, CPU and device are using
> > different address.
> 
> So you are relying on the platform to do the SVA for the device?
> 

Sorry for late response.

uacce and its userspace framework UADK depend on SVA, leveraging
the enhanced security by isolated process address space.

This patch is mainly an extension for performance optimization to
get stable high-performance I/O on pinned memory even though the
hardware supports IO page fault to get pages back after swapping
out or page migration.
But IO page fault will cause serious latency jitter for high-speed
I/O.
For slow speed device, they don't need to use this extension.

> This feels like it goes back to another topic where I felt the SVA
> setup uAPI should be shared and not buried into every driver's unique
> ioctls.
> 
> Having something like this in a shared SVA system is somewhat less
> strange.

Sounds reasonable. On the other hand, uacce seems to be an common
uAPI for SVA, and probably the only one for this moment.

uacce is a framework not a specific driver as any accelerators
can hook into this framework as long as a device provides
uacce_ops and register itself by uacce_register(). Uacce, for
itself, doesn't bind with any specific hardware. So uacce interfaces
are kind of common uAPI :-)

> 
> Jason

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC PATCH v2] uacce: Add uacce_ctrl misc device

2021-01-25 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Jason Gunthorpe [mailto:j...@ziepe.ca]
> Sent: Tuesday, January 26, 2021 2:13 PM
> To: Song Bao Hua (Barry Song) 
> Cc: Wangzhou (B) ; Greg Kroah-Hartman
> ; Arnd Bergmann ; Zhangfei Gao
> ; linux-accelerat...@lists.ozlabs.org;
> linux-ker...@vger.kernel.org; iommu@lists.linux-foundation.org;
> linux...@kvack.org; Liguozhu (Kenneth) ; chensihang
> (A) 
> Subject: Re: [RFC PATCH v2] uacce: Add uacce_ctrl misc device
> 
> On Mon, Jan 25, 2021 at 11:35:22PM +, Song Bao Hua (Barry Song) wrote:
> 
> > > On Mon, Jan 25, 2021 at 10:21:14PM +, Song Bao Hua (Barry Song) wrote:
> > > > mlock, while certainly be able to prevent swapping out, it won't
> > > > be able to stop page moving due to:
> > > > * memory compaction in alloc_pages()
> > > > * making huge pages
> > > > * numa balance
> > > > * memory compaction in CMA
> > >
> > > Enabling those things is a major reason to have SVA device in the
> > > first place, providing a SW API to turn it all off seems like the
> > > wrong direction.
> >
> > I wouldn't say this is a major reason to have SVA. If we read the
> > history of SVA and papers, people would think easy programming due
> > to data struct sharing between cpu and device, and process space
> > isolation in device would be the major reasons for SVA. SVA also
> > declares it supports zero-copy while zero-copy doesn't necessarily
> > depend on SVA.
> 
> Once you have to explicitly make system calls to declare memory under
> IO, you loose all of that.
> 
> Since you've asked the app to be explicit about the DMAs it intends to
> do, there is not really much reason to use SVA for those DMAs anymore.

Let's see a non-SVA case. We are not using SVA, we can have
a memory pool by hugetlb or pin, and app can allocate memory
from this pool, and get stable I/O performance on the memory
from the pool. But device has its separate page table which
is not bound with this process, thus lacking the protection
of process space isolation. Plus, CPU and device are using
different address.

And then we move to SVA case, we can still have a memory pool
by hugetlb or pin, and app can allocate memory from this pool
since this pool is mapped to the address space of the process,
and we are able to get stable I/O performance since it is always
there. But in this case, device is using the page table of
process with the full permission control.
And they are using same address and can possibly enjoy the easy
programming if HW supports.

SVA is not doom to work with IO page fault only. If we have SVA+pin,
we would get both sharing address and stable I/O latency.

> 
> Jason

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC PATCH v2] uacce: Add uacce_ctrl misc device

2021-01-25 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On Behalf Of
> Jason Gunthorpe
> Sent: Tuesday, January 26, 2021 12:16 PM
> To: Song Bao Hua (Barry Song) 
> Cc: Wangzhou (B) ; Greg Kroah-Hartman
> ; Arnd Bergmann ; Zhangfei Gao
> ; linux-accelerat...@lists.ozlabs.org;
> linux-ker...@vger.kernel.org; iommu@lists.linux-foundation.org;
> linux...@kvack.org; Liguozhu (Kenneth) ; chensihang
> (A) 
> Subject: Re: [RFC PATCH v2] uacce: Add uacce_ctrl misc device
> 
> On Mon, Jan 25, 2021 at 10:21:14PM +, Song Bao Hua (Barry Song) wrote:
> > mlock, while certainly be able to prevent swapping out, it won't
> > be able to stop page moving due to:
> > * memory compaction in alloc_pages()
> > * making huge pages
> > * numa balance
> > * memory compaction in CMA
> 
> Enabling those things is a major reason to have SVA device in the
> first place, providing a SW API to turn it all off seems like the
> wrong direction.

I wouldn't say this is a major reason to have SVA. If we read the
history of SVA and papers, people would think easy programming due
to data struct sharing between cpu and device, and process space
isolation in device would be the major reasons for SVA. SVA also
declares it supports zero-copy while zero-copy doesn't necessarily
depend on SVA.

Page migration and I/O page fault overhead, on the other hand, would
probably be the major problems which block SVA becoming a 
high-performance and more popular solution.

> 
> If the device doesn't want to use SVA then don't use it, use normal
> DMA pinning like everything else.
> 

If we disable SVA, we won't get the benefits of SVA on address sharing,
and process space isolation.

> Jason

Thanks
Barry
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC PATCH v2] uacce: Add uacce_ctrl misc device

2021-01-25 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Jason Gunthorpe [mailto:j...@ziepe.ca]
> Sent: Tuesday, January 26, 2021 4:47 AM
> To: Wangzhou (B) 
> Cc: Greg Kroah-Hartman ; Arnd Bergmann
> ; Zhangfei Gao ;
> linux-accelerat...@lists.ozlabs.org; linux-ker...@vger.kernel.org;
> iommu@lists.linux-foundation.org; linux...@kvack.org; Song Bao Hua (Barry 
> Song)
> ; Liguozhu (Kenneth) ;
> chensihang (A) 
> Subject: Re: [RFC PATCH v2] uacce: Add uacce_ctrl misc device
> 
> On Mon, Jan 25, 2021 at 04:34:56PM +0800, Zhou Wang wrote:
> 
> > +static int uacce_pin_page(struct uacce_pin_container *priv,
> > + struct uacce_pin_address *addr)
> > +{
> > +   unsigned int flags = FOLL_FORCE | FOLL_WRITE;
> > +   unsigned long first, last, nr_pages;
> > +   struct page **pages;
> > +   struct pin_pages *p;
> > +   int ret;
> > +
> > +   first = (addr->addr & PAGE_MASK) >> PAGE_SHIFT;
> > +   last = ((addr->addr + addr->size - 1) & PAGE_MASK) >> PAGE_SHIFT;
> > +   nr_pages = last - first + 1;
> > +
> > +   pages = vmalloc(nr_pages * sizeof(struct page *));
> > +   if (!pages)
> > +   return -ENOMEM;
> > +
> > +   p = kzalloc(sizeof(*p), GFP_KERNEL);
> > +   if (!p) {
> > +   ret = -ENOMEM;
> > +   goto free;
> > +   }
> > +
> > +   ret = pin_user_pages_fast(addr->addr & PAGE_MASK, nr_pages,
> > + flags | FOLL_LONGTERM, pages);
> 
> This needs to copy the RLIMIT_MEMLOCK and can_do_mlock() stuff from
> other places, like ib_umem_get
> 
> > +   ret = xa_err(xa_store(>array, p->first, p, GFP_KERNEL));
> 
> And this is really weird, I don't think it makes sense to make handles
> for DMA based on the starting VA.
> 
> > +static int uacce_unpin_page(struct uacce_pin_container *priv,
> > +   struct uacce_pin_address *addr)
> > +{
> > +   unsigned long first, last, nr_pages;
> > +   struct pin_pages *p;
> > +
> > +   first = (addr->addr & PAGE_MASK) >> PAGE_SHIFT;
> > +   last = ((addr->addr + addr->size - 1) & PAGE_MASK) >> PAGE_SHIFT;
> > +   nr_pages = last - first + 1;
> > +
> > +   /* find pin_pages */
> > +   p = xa_load(>array, first);
> > +   if (!p)
> > +   return -ENODEV;
> > +
> > +   if (p->nr_pages != nr_pages)
> > +   return -EINVAL;
> > +
> > +   /* unpin */
> > +   unpin_user_pages(p->pages, p->nr_pages);
> 
> And unpinning without guaranteeing there is no ongoing DMA is really
> weird

In SVA case, kernel has no idea if accelerators are accessing
the memory so I would assume SVA has a method to prevent
the pages being transferred from migration or release. Otherwise,
SVA will crash easily in a system with high memory pressure.

Anyway, This is a problem worth further investigating.

> 
> Are you abusing this in conjunction with a SVA scheme just to prevent
> page motion? Why wasn't mlock good enough?

Page migration won't cause any disfunction in SVA case as IO page
fault will get a valid page again. It is only a performance issue
as IO page fault has larger latency than the usual page fault,
would be 3-80slower than page fault[1]

mlock, while certainly be able to prevent swapping out, it won't
be able to stop page moving due to:
* memory compaction in alloc_pages()
* making huge pages
* numa balance
* memory compaction in CMA
etc.

[1] https://ieeexplore.ieee.org/stamp/stamp.jsp?tp==7482091=1
> 
> Jason

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [PATCH] dma-mapping: benchmark: check the validity of dma mask bits

2020-12-18 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Robin Murphy [mailto:robin.mur...@arm.com]
> Sent: Saturday, December 19, 2020 7:10 AM
> To: Song Bao Hua (Barry Song) ; h...@lst.de;
> m.szyprow...@samsung.com
> Cc: iommu@lists.linux-foundation.org; Linuxarm ; Dan
> Carpenter 
> Subject: Re: [PATCH] dma-mapping: benchmark: check the validity of dma mask
> bits
> 
> On 2020-12-12 10:18, Barry Song wrote:
> > While dma_mask_bits is larger than 64, the bahvaiour is undefined. On the
> > other hand, dma_mask_bits which is smaller than 20 (1MB) makes no sense
> > in real hardware.
> >
> > Reported-by: Dan Carpenter 
> > Signed-off-by: Barry Song 
> > ---
> >   kernel/dma/map_benchmark.c | 6 ++
> >   1 file changed, 6 insertions(+)
> >
> > diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c
> > index b1496e744c68..19f661692073 100644
> > --- a/kernel/dma/map_benchmark.c
> > +++ b/kernel/dma/map_benchmark.c
> > @@ -214,6 +214,12 @@ static long map_benchmark_ioctl(struct file *file,
> unsigned int cmd,
> > return -EINVAL;
> > }
> >
> > +   if (map->bparam.dma_bits < 20 ||
> 
> FWIW I don't think we need to bother with a lower limit here - it's
> unsigned, and a pointlessly small value will fail gracefully when we
> come to actually set the mask anyway. We only need to protect kernel
> code from going wrong, not userspace from being stupid to its own detriment.

I am not sure if kernel driver can reject small dma mask bit if drivers
don't handle it properly.
As a month ago, when I was debugging dma map benchmark, I set a value
less than 32 to devices behind arm-smmu-v3, it could always succeed.
But dma_map_single() was always failing.
At that time, I didn't debug this issue. Not sure the latest status of
iommu driver.

drivers/iommu/intel/iommu.c used to have a dma_supported() to reject
small dma_mask:
static const struct dma_map_ops bounce_dma_ops = {
...
.dma_supported  = dma_direct_supported,
};


> 
> Robin.
> 
> > +   map->bparam.dma_bits > 64) {
> > +   pr_err("invalid dma_bits\n");
> > +   return -EINVAL;
> > +   }
> > +
> > if (map->bparam.node != NUMA_NO_NODE &&
> > !node_possible(map->bparam.node)) {
> > pr_err("invalid numa node\n");
> >

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v2] dma-mapping: add unlikely hint for error path in dma_mapping_error

2020-12-13 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Heiner Kallweit [mailto:hkallwe...@gmail.com]
> Sent: Monday, December 14, 2020 5:33 AM
> To: Christoph Hellwig ; Marek Szyprowski
> ; Robin Murphy ; Song Bao Hua
> (Barry Song) 
> Cc: open list:AMD IOMMU (AMD-VI) ; Linux
> Kernel Mailing List 
> Subject: [PATCH v2] dma-mapping: add unlikely hint for error path in
> dma_mapping_error
> 
> Zillions of drivers use the unlikely() hint when checking the result of
> dma_mapping_error(). This is an inline function anyway, so we can move
> the hint into this function and remove it from drivers.
> 
> Signed-off-by: Heiner Kallweit 

not sure if this is really necessary. It seems the original code
is more readable. Readers can more easily understand we are
predicting the branch based on the return value of
dma_mapping_error().

Anyway, I don't object to this one. if other people like it, I am
also ok with it.

> ---
> v2:
> Split the big patch into the change for dma-mapping.h and follow-up
> patches per subsystem that will go through the trees of the respective
> maintainers.
> ---
>  include/linux/dma-mapping.h | 2 +-
>  kernel/dma/map_benchmark.c  | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
> index 2e49996a8..6177e20b5 100644
> --- a/include/linux/dma-mapping.h
> +++ b/include/linux/dma-mapping.h
> @@ -95,7 +95,7 @@ static inline int dma_mapping_error(struct device *dev,
> dma_addr_t dma_addr)
>  {
>   debug_dma_mapping_error(dev, dma_addr);
> 
> - if (dma_addr == DMA_MAPPING_ERROR)
> + if (unlikely(dma_addr == DMA_MAPPING_ERROR))
>   return -ENOMEM;
>   return 0;
>  }
> diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c
> index b1496e744..901420a5d 100644
> --- a/kernel/dma/map_benchmark.c
> +++ b/kernel/dma/map_benchmark.c
> @@ -78,7 +78,7 @@ static int map_benchmark_thread(void *data)
> 
>   map_stime = ktime_get();
>   dma_addr = dma_map_single(map->dev, buf, PAGE_SIZE, map->dir);
> - if (unlikely(dma_mapping_error(map->dev, dma_addr))) {
> + if (dma_mapping_error(map->dev, dma_addr)) {
>   pr_err("dma_map_single failed on %s\n",
>   dev_name(map->dev));
>   ret = -ENOMEM;
> --
> 2.29.2

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [bug report] dma-mapping: add benchmark support for streaming DMA APIs

2020-12-09 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Dan Carpenter [mailto:dan.carpen...@oracle.com]
> Sent: Wednesday, December 9, 2020 8:00 PM
> To: Song Bao Hua (Barry Song) 
> Cc: iommu@lists.linux-foundation.org
> Subject: [bug report] dma-mapping: add benchmark support for streaming DMA 
> APIs
> 
> Hello Barry Song,
> 
> The patch 65789daa8087: "dma-mapping: add benchmark support for
> streaming DMA APIs" from Nov 16, 2020, leads to the following static
> checker warning:
> 
>   kernel/dma/map_benchmark.c:241 map_benchmark_ioctl()
>   error: undefined (user controlled) shift '1 << (map->bparam.dma_bits)'
> 
> kernel/dma/map_benchmark.c
>191  static long map_benchmark_ioctl(struct file *file, unsigned int cmd,
>192  unsigned long arg)
>193  {
>194  struct map_benchmark_data *map = file->private_data;
>195  void __user *argp = (void __user *)arg;
>196  u64 old_dma_mask;
>197
>198  int ret;
>199
>200  if (copy_from_user(>bparam, argp, sizeof(map->bparam)))
>^
> Comes from the user
> 
>201  return -EFAULT;
>202
>203  switch (cmd) {
>204  case DMA_MAP_BENCHMARK:
>205  if (map->bparam.threads == 0 ||
>206  map->bparam.threads > DMA_MAP_MAX_THREADS) {
>207  pr_err("invalid thread number\n");
>208  return -EINVAL;
>209  }
>210
>211  if (map->bparam.seconds == 0 ||
>212  map->bparam.seconds > DMA_MAP_MAX_SECONDS) {
>213  pr_err("invalid duration seconds\n");
>214  return -EINVAL;
>215  }
>216
>217  if (map->bparam.node != NUMA_NO_NODE &&
>218  !node_possible(map->bparam.node)) {
>219  pr_err("invalid numa node\n");
>220  return -EINVAL;
>221  }
>222
>223  switch (map->bparam.dma_dir) {
>224  case DMA_MAP_BIDIRECTIONAL:
>225  map->dir = DMA_BIDIRECTIONAL;
>226  break;
>227  case DMA_MAP_FROM_DEVICE:
>228  map->dir = DMA_FROM_DEVICE;
>229  break;
>230  case DMA_MAP_TO_DEVICE:
>231  map->dir = DMA_TO_DEVICE;
>232  break;
>233  default:
>234  pr_err("invalid DMA direction\n");
>235  return -EINVAL;
>236  }
>237
>238  old_dma_mask = dma_get_mask(map->dev);
>239
>240  ret = dma_set_mask(map->dev,
>241 
> DMA_BIT_MASK(map->bparam.dma_bits));
>^^
> If this is more than 31 then the behavior is undefined (but in real life
> it will shift wrap).

Guess it should be less than 64?
For 64, it would be ~0ULL, otherwise, it will be 1ULL<https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=7679325702

I have some code like:
+   /* suppose the mininum DMA zone is 1MB in the world */
+   if (bits < 20 || bits > 64) {
+   fprintf(stderr, "invalid dma mask bit, must be in 20-64\n");
+   exit(1);
+   }

Maybe I should do the same thing in kernel as well.

> 
>242  if (ret) {
>243  pr_err("failed to set dma_mask on device 
> %s\n",
>244  dev_name(map->dev));
>245  return -EINVAL;
>246  }
>247
>248  ret = do_map_benchmark(map);
>249
>250  /*
>251   * restore the original dma_mask as many devices' 
> dma_mask
> are
>252   * set by architectures, acpi, busses. When we bind 
> them
> back
>253   * to their original drivers, those drivers shouldn't 
> see
>254   * dma_mask changed by benchmark
>255   */
>256  dma_set_mask(map->dev, old_dma_mask);
>257  break;
> 

RE: [PATCH] dma-mapping: Fix sizeof() mismatch on tsk allocation

2020-11-25 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Colin King [mailto:colin.k...@canonical.com]
> Sent: Thursday, November 26, 2020 3:05 AM
> To: Song Bao Hua (Barry Song) ; Christoph
> Hellwig ; Marek Szyprowski ;
> Robin Murphy ; iommu@lists.linux-foundation.org
> Cc: kernel-janit...@vger.kernel.org; linux-ker...@vger.kernel.org
> Subject: [PATCH] dma-mapping: Fix sizeof() mismatch on tsk allocation
> 
> From: Colin Ian King 
> 
> An incorrect sizeof() is being used, sizeof(tsk) is not correct, it should
> be sizeof(*tsk). Fix it.
> 
> Addresses-Coverity: ("Sizeof not portable (SIZEOF_MISMATCH)")
> Fixes: bfd2defed94d ("dma-mapping: add benchmark support for streaming
> DMA APIs")
> Signed-off-by: Colin Ian King 
> ---
>  kernel/dma/map_benchmark.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c
> index e1e37603d01b..b1496e744c68 100644
> --- a/kernel/dma/map_benchmark.c
> +++ b/kernel/dma/map_benchmark.c
> @@ -121,7 +121,7 @@ static int do_map_benchmark(struct
> map_benchmark_data *map)
>   int ret = 0;
>   int i;
> 
> - tsk = kmalloc_array(threads, sizeof(tsk), GFP_KERNEL);
> + tsk = kmalloc_array(threads, sizeof(*tsk), GFP_KERNEL);

The size is same. But the change is correct.
Acked-by: Barry Song 

>   if (!tsk)
>   return -ENOMEM;
> 
Thanks
Barry


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH] dma-mapping: fix an uninitialized pointer read due to typo in argp assignment

2020-11-25 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Colin King [mailto:colin.k...@canonical.com]
> Sent: Thursday, November 26, 2020 2:56 AM
> To: Song Bao Hua (Barry Song) ; Christoph
> Hellwig ; Marek Szyprowski ;
> Robin Murphy ; iommu@lists.linux-foundation.org
> Cc: kernel-janit...@vger.kernel.org; linux-ker...@vger.kernel.org
> Subject: [PATCH] dma-mapping: fix an uninitialized pointer read due to typo in
> argp assignment
> 
> From: Colin Ian King 
> 
> The assignment of argp is currently using argp as the source because of
> a typo. Fix this by assigning it the value passed in arg instead.
> 
> Addresses-Coverity: ("Uninitialized pointer read")
> Fixes: bfd2defed94d ("dma-mapping: add benchmark support for streaming
> DMA APIs")
> Signed-off-by: Colin Ian King 

Acked-by: Barry Song 

> ---
>  kernel/dma/map_benchmark.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c
> index ca616b664f72..e1e37603d01b 100644
> --- a/kernel/dma/map_benchmark.c
> +++ b/kernel/dma/map_benchmark.c
> @@ -192,7 +192,7 @@ static long map_benchmark_ioctl(struct file *file,
> unsigned int cmd,
>   unsigned long arg)
>  {
>   struct map_benchmark_data *map = file->private_data;
> - void __user *argp = (void __user *)argp;
> + void __user *argp = (void __user *)arg;
>   u64 old_dma_mask;
> 
>   int ret;
> --
> 2.29.2

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v3 1/2] dma-mapping: add benchmark support for streaming DMA APIs

2020-11-15 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Christoph Hellwig [mailto:h...@lst.de]
> Sent: Sunday, November 15, 2020 9:45 PM
> To: Song Bao Hua (Barry Song) 
> Cc: Christoph Hellwig ; iommu@lists.linux-foundation.org;
> robin.mur...@arm.com; m.szyprow...@samsung.com; Linuxarm
> ; linux-kselft...@vger.kernel.org; xuwei (O)
> ; Joerg Roedel ; Will Deacon
> ; Shuah Khan 
> Subject: Re: [PATCH v3 1/2] dma-mapping: add benchmark support for
> streaming DMA APIs
> 
> On Sun, Nov 15, 2020 at 12:11:15AM +, Song Bao Hua (Barry Song)
> wrote:
> >
> > Checkpatch has changed 80 to 100. That's probably why my local checkpatch
> didn't report any warning:
> >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=
> bdc48fa11e46f867ea4d
> >
> > I am happy to change them to be less than 80 if you like.
> 
> Don't rely on checkpath, is is broken.  Look at the codingstyle document.
> 
> > > I think this needs to set a dma mask as behavior for unlimited dma
> > > mask vs the default 32-bit one can be very different.
> >
> > I actually prefer users bind real devices with real dma_mask to test rather
> than force to change
> > the dma_mask in this benchmark.
> 
> The mask is set by the driver, not the device.  So you need to set when
> when you bind, real device or not.

Yep while it is a little bit tricky.

Sometimes, it is done by "device" in architectures, e.g. there are lots of
dma_mask configuration code in arch/arm/mach-xxx.
arch/arm/mach-davinci/da850.c
static u64 da850_vpif_dma_mask = DMA_BIT_MASK(32);
static struct platform_device da850_vpif_dev = {
.name   = "vpif",
.id = -1,
.dev= {
.dma_mask   = _vpif_dma_mask,
.coherent_dma_mask  = DMA_BIT_MASK(32),
},
.resource   = da850_vpif_resource,
.num_resources  = ARRAY_SIZE(da850_vpif_resource),
};

Sometimes, it is done by "of" or "acpi", for example:
drivers/acpi/arm64/iort.c
void iort_dma_setup(struct device *dev, u64 *dma_addr, u64 *dma_size)
{
u64 end, mask, dmaaddr = 0, size = 0, offset = 0;
int ret;

...

ret = acpi_dma_get_range(dev, , , );
if (!ret) {
/*
 * Limit coherent and dma mask based on size retrieved from
 * firmware.
 */
end = dmaaddr + size - 1;
mask = DMA_BIT_MASK(ilog2(end) + 1);
dev->bus_dma_limit = end;
dev->coherent_dma_mask = mask;
*dev->dma_mask = mask;
}
...
}

Sometimes, it is done by "bus", for example, ISA:
isa_dev->dev.coherent_dma_mask = DMA_BIT_MASK(24);
isa_dev->dev.dma_mask = _dev->dev.coherent_dma_mask;

error = device_register(_dev->dev);
if (error) {
put_device(_dev->dev);
break;
}

And in many cases, it is done by driver. On the ARM64 server platform I am 
testing,
actually rarely drivers set dma_mask.

So to make the dma benchmark work on all platforms, it seems it is worth
to add a dma_mask_bit parameter. But, in order to avoid breaking the
dma_mask of those devices whose dma_mask are set by architectures, 
acpi and bus, it seems we need to do the below in dma_benchmark:

u64 old_mask;

old_mask = dma_get_mask(dev);

dma_set_mask(dev, _mask);

do_map_benchmark();

/* restore old dma_mask so that the dma_mask of the device is not changed due to
benchmark when it is bound back to its original driver */
dma_set_mask(dev, _mask);

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v3 1/2] dma-mapping: add benchmark support for streaming DMA APIs

2020-11-14 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Christoph Hellwig [mailto:h...@lst.de]
> Sent: Sunday, November 15, 2020 5:54 AM
> To: Song Bao Hua (Barry Song) 
> Cc: iommu@lists.linux-foundation.org; h...@lst.de; robin.mur...@arm.com;
> m.szyprow...@samsung.com; Linuxarm ;
> linux-kselft...@vger.kernel.org; xuwei (O) ; Joerg
> Roedel ; Will Deacon ; Shuah Khan
> 
> Subject: Re: [PATCH v3 1/2] dma-mapping: add benchmark support for
> streaming DMA APIs
> 
> Lots of > 80 char lines.  Please fix up the style.

Checkpatch has changed 80 to 100. That's probably why my local checkpatch 
didn't report any warning:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=bdc48fa11e46f867ea4d

I am happy to change them to be less than 80 if you like.

> 
> I think this needs to set a dma mask as behavior for unlimited dma
> mask vs the default 32-bit one can be very different. 

I actually prefer users bind real devices with real dma_mask to test rather 
than force to change
the dma_mask in this benchmark.

Some device might have 32bit dma_mask while some others might have unlimited. 
But both of
them can bind to this driver or unbind from it after the test is done. So users 
just need to bind
those different real devices with different real dma_mask to dma_benchmark.

This can reflect the real performance of the real device better, I think.

> I also think you need to be able to pass the direction or have different tests
> for directions.  bidirectional is not exactly heavily used and pays
> more cache management penality.

For this, I'd like to increase a direction option in the test app and pass the 
option to the benchmark
driver.

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v3 1/2] dma-mapping: add benchmark support for streaming DMA APIs

2020-11-11 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: John Garry
> Sent: Wednesday, November 11, 2020 10:37 PM
> To: Song Bao Hua (Barry Song) ;
> iommu@lists.linux-foundation.org; h...@lst.de; robin.mur...@arm.com;
> m.szyprow...@samsung.com
> Cc: linux-kselft...@vger.kernel.org; Will Deacon ; Joerg
> Roedel ; Linuxarm ; xuwei (O)
> ; Shuah Khan 
> Subject: Re: [PATCH v3 1/2] dma-mapping: add benchmark support for
> streaming DMA APIs
> 
> On 11/11/2020 01:29, Song Bao Hua (Barry Song) wrote:
> > I'd like to think checking this here would be overdesign. We just give 
> > users the
> > freedom to bind any device they care about to the benchmark driver. Usually
> > that means a real hardware either behind an IOMMU or through a direct
> > mapping.
> >
> > if for any reason users put a wrong "device", that is the choice of users.
> 
> Right, but if the device simply has no DMA ops supported, it could be
> better to fail the probe rather than let them try the test at all.
> 
>   Anyhow,
> > the below code will still handle it properly and users will get a report in 
> > which
> > everything is zero.
> >
> > +static int map_benchmark_thread(void *data)
> > +{
> > ...
> > +   dma_addr = dma_map_single(map->dev, buf, PAGE_SIZE,
> DMA_BIDIRECTIONAL);
> > +   if (unlikely(dma_mapping_error(map->dev, dma_addr))) {
> 
> Doing this is proper, but I am not sure if this tells the user the real
> problem.

Telling users the real problem isn't the design intention of this test
benchmark. It is never the purpose of this benchmark.

> 
> > +   pr_err("dma_map_single failed on %s\n",
> dev_name(map->dev));
> 
> Not sure why use pr_err() over dev_err().

We are reporting errors in dma-benchmark driver rather than reporting errors
in the driver of the specific device. I think we should have "dma-benchmark"
as the prefix while printing the name of the device by dev_name().

> 
> > +   ret = -ENOMEM;
> > +   goto out;
> > +   }
> 
> Thanks,
> John

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v3 1/2] dma-mapping: add benchmark support for streaming DMA APIs

2020-11-10 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: John Garry
> Sent: Tuesday, November 10, 2020 9:39 PM
> To: Song Bao Hua (Barry Song) ;
> iommu@lists.linux-foundation.org; h...@lst.de; robin.mur...@arm.com;
> m.szyprow...@samsung.com
> Cc: linux-kselft...@vger.kernel.org; Will Deacon ; Joerg
> Roedel ; Linuxarm ; xuwei (O)
> ; Shuah Khan 
> Subject: Re: [PATCH v3 1/2] dma-mapping: add benchmark support for
> streaming DMA APIs
> 
> On 10/11/2020 08:10, Song Bao Hua (Barry Song) wrote:
> > Hello Robin, Christoph,
> > Any further comment? John suggested that "depends on DEBUG_FS" should
> be added in Kconfig.
> > I am collecting more comments to send v4 together with fixing this minor
> issue :-)
> >
> > Thanks
> > Barry
> >
> >> -Original Message-
> >> From: Song Bao Hua (Barry Song)
> >> Sent: Monday, November 2, 2020 9:07 PM
> >> To: iommu@lists.linux-foundation.org; h...@lst.de;
> robin.mur...@arm.com;
> >> m.szyprow...@samsung.com
> >> Cc: Linuxarm ; linux-kselft...@vger.kernel.org;
> xuwei
> >> (O) ; Song Bao Hua (Barry Song)
> >> ; Joerg Roedel ; Will
> Deacon
> >> ; Shuah Khan 
> >> Subject: [PATCH v3 1/2] dma-mapping: add benchmark support for
> streaming
> >> DMA APIs
> >>
> >> Nowadays, there are increasing requirements to benchmark the
> performance
> >> of dma_map and dma_unmap particually while the device is attached to an
> >> IOMMU.
> >>
> >> This patch enables the support. Users can run specified number of threads
> to
> >> do dma_map_page and dma_unmap_page on a specific NUMA node with
> the
> >> specified duration. Then dma_map_benchmark will calculate the average
> >> latency for map and unmap.
> >>
> >> A difficulity for this benchmark is that dma_map/unmap APIs must run on a
> >> particular device. Each device might have different backend of IOMMU or
> >> non-IOMMU.
> >>
> >> So we use the driver_override to bind dma_map_benchmark to a particual
> >> device by:
> >> For platform devices:
> >> echo dma_map_benchmark >
> /sys/bus/platform/devices/xxx/driver_override
> >> echo xxx > /sys/bus/platform/drivers/xxx/unbind
> >> echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind
> >>
> 
> Hi Barry,
> 
> >> For PCI devices:
> >> echo dma_map_benchmark >
> >> /sys/bus/pci/devices/:00:01.0/driver_override
> >> echo :00:01.0 > /sys/bus/pci/drivers/xxx/unbind echo :00:01.0 >
> >> /sys/bus/pci/drivers/dma_map_benchmark/bind
> 
> Do we need to check if the device to which we attach actually has DMA
> mapping capability?

Hello John,

I'd like to think checking this here would be overdesign. We just give users the
freedom to bind any device they care about to the benchmark driver. Usually
that means a real hardware either behind an IOMMU or through a direct
mapping.

if for any reason users put a wrong "device", that is the choice of users. 
Anyhow,
the below code will still handle it properly and users will get a report in 
which
everything is zero.

+static int map_benchmark_thread(void *data)
+{
...
+   dma_addr = dma_map_single(map->dev, buf, PAGE_SIZE, 
DMA_BIDIRECTIONAL);
+   if (unlikely(dma_mapping_error(map->dev, dma_addr))) {
+   pr_err("dma_map_single failed on %s\n", 
dev_name(map->dev));
+   ret = -ENOMEM;
+   goto out;
+   }
...
+}

> 
> >>
> >> Cc: Joerg Roedel 
> >> Cc: Will Deacon 
> >> Cc: Shuah Khan 
> >> Cc: Christoph Hellwig 
> >> Cc: Marek Szyprowski 
> >> Cc: Robin Murphy 
> >> Signed-off-by: Barry Song 
> >> ---
> 
> Thanks,
> John

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v3 1/2] dma-mapping: add benchmark support for streaming DMA APIs

2020-11-10 Thread Song Bao Hua (Barry Song)
Hello Robin, Christoph,
Any further comment? John suggested that "depends on DEBUG_FS" should be added 
in Kconfig.
I am collecting more comments to send v4 together with fixing this minor issue 
:-)

Thanks
Barry

> -Original Message-
> From: Song Bao Hua (Barry Song)
> Sent: Monday, November 2, 2020 9:07 PM
> To: iommu@lists.linux-foundation.org; h...@lst.de; robin.mur...@arm.com;
> m.szyprow...@samsung.com
> Cc: Linuxarm ; linux-kselft...@vger.kernel.org; xuwei
> (O) ; Song Bao Hua (Barry Song)
> ; Joerg Roedel ; Will Deacon
> ; Shuah Khan 
> Subject: [PATCH v3 1/2] dma-mapping: add benchmark support for streaming
> DMA APIs
> 
> Nowadays, there are increasing requirements to benchmark the performance
> of dma_map and dma_unmap particually while the device is attached to an
> IOMMU.
> 
> This patch enables the support. Users can run specified number of threads to
> do dma_map_page and dma_unmap_page on a specific NUMA node with the
> specified duration. Then dma_map_benchmark will calculate the average
> latency for map and unmap.
> 
> A difficulity for this benchmark is that dma_map/unmap APIs must run on a
> particular device. Each device might have different backend of IOMMU or
> non-IOMMU.
> 
> So we use the driver_override to bind dma_map_benchmark to a particual
> device by:
> For platform devices:
> echo dma_map_benchmark > /sys/bus/platform/devices/xxx/driver_override
> echo xxx > /sys/bus/platform/drivers/xxx/unbind
> echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind
> 
> For PCI devices:
> echo dma_map_benchmark >
> /sys/bus/pci/devices/:00:01.0/driver_override
> echo :00:01.0 > /sys/bus/pci/drivers/xxx/unbind echo :00:01.0 >
> /sys/bus/pci/drivers/dma_map_benchmark/bind
> 
> Cc: Joerg Roedel 
> Cc: Will Deacon 
> Cc: Shuah Khan 
> Cc: Christoph Hellwig 
> Cc: Marek Szyprowski 
> Cc: Robin Murphy 
> Signed-off-by: Barry Song 
> ---
> -v3:
>   * fix build issues reported by 0day kernel test robot
> -v2:
>   * add PCI support; v1 supported platform devices only
>   * replace ssleep by msleep_interruptible() to permit users to exit
> benchmark before it is completed
>   * many changes according to Robin's suggestions, thanks! Robin
> - add standard deviation output to reflect the worst case
> - check users' parameters strictly like the number of threads
> - make cache dirty before dma_map
> - fix unpaired dma_map_page and dma_unmap_single;
> - remove redundant "long long" before ktime_to_ns();
> - use devm_add_action()
> 
>  kernel/dma/Kconfig |   8 +
>  kernel/dma/Makefile|   1 +
>  kernel/dma/map_benchmark.c | 296
> +
>  3 files changed, 305 insertions(+)
>  create mode 100644 kernel/dma/map_benchmark.c
> 
> diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig index
> c99de4a21458..949c53da5991 100644
> --- a/kernel/dma/Kconfig
> +++ b/kernel/dma/Kconfig
> @@ -225,3 +225,11 @@ config DMA_API_DEBUG_SG
> is technically out-of-spec.
> 
> If unsure, say N.
> +
> +config DMA_MAP_BENCHMARK
> + bool "Enable benchmarking of streaming DMA mapping"
> + help
> +   Provides /sys/kernel/debug/dma_map_benchmark that helps with
> testing
> +   performance of dma_(un)map_page.
> +
> +   See tools/testing/selftests/dma/dma_map_benchmark.c
> diff --git a/kernel/dma/Makefile b/kernel/dma/Makefile index
> dc755ab68aab..7aa6b26b1348 100644
> --- a/kernel/dma/Makefile
> +++ b/kernel/dma/Makefile
> @@ -10,3 +10,4 @@ obj-$(CONFIG_DMA_API_DEBUG) += debug.o
>  obj-$(CONFIG_SWIOTLB)+= swiotlb.o
>  obj-$(CONFIG_DMA_COHERENT_POOL)  += pool.o
>  obj-$(CONFIG_DMA_REMAP)  += remap.o
> +obj-$(CONFIG_DMA_MAP_BENCHMARK)  += map_benchmark.o
> diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c
> new file mode 100644 index ..dc4e5ff48a2d
> --- /dev/null
> +++ b/kernel/dma/map_benchmark.c
> @@ -0,0 +1,296 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (C) 2020 Hisilicon Limited.
> + */
> +
> +#define pr_fmt(fmt)  KBUILD_MODNAME ": " fmt
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#define DMA_MAP_BENCHMARK_IOWR('d', 1, struct map_benchmark)
> +#define DMA_MAP_MAX_THREADS  1024
> +#define DMA_MAP_MAX_SECONDS  300
> +
> +struct map_benchmark {
> + __u64 avg_map_100ns; /* average map latency in 100ns */
&

RE: [PATCH v3 1/2] dma-mapping: add benchmark support for streaming DMA APIs

2020-11-02 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: John Garry
> Sent: Monday, November 2, 2020 10:19 PM
> To: Song Bao Hua (Barry Song) ;
> iommu@lists.linux-foundation.org; h...@lst.de; robin.mur...@arm.com;
> m.szyprow...@samsung.com
> Cc: linux-kselft...@vger.kernel.org; Shuah Khan ; Joerg
> Roedel ; Linuxarm ; xuwei (O)
> ; Will Deacon 
> Subject: Re: [PATCH v3 1/2] dma-mapping: add benchmark support for
> streaming DMA APIs
> 
> On 02/11/2020 08:06, Barry Song wrote:
> > Nowadays, there are increasing requirements to benchmark the performance
> > of dma_map and dma_unmap particually while the device is attached to an
> > IOMMU.
> >
> > This patch enables the support. Users can run specified number of threads
> > to do dma_map_page and dma_unmap_page on a specific NUMA node with
> the
> > specified duration. Then dma_map_benchmark will calculate the average
> > latency for map and unmap.
> >
> > A difficulity for this benchmark is that dma_map/unmap APIs must run on
> > a particular device. Each device might have different backend of IOMMU or
> > non-IOMMU.
> >
> > So we use the driver_override to bind dma_map_benchmark to a particual
> > device by:
> > For platform devices:
> > echo dma_map_benchmark > /sys/bus/platform/devices/xxx/driver_override
> > echo xxx > /sys/bus/platform/drivers/xxx/unbind
> > echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind
> >
> > For PCI devices:
> > echo dma_map_benchmark >
> /sys/bus/pci/devices/:00:01.0/driver_override
> > echo :00:01.0 > /sys/bus/pci/drivers/xxx/unbind
> > echo :00:01.0 > /sys/bus/pci/drivers/dma_map_benchmark/bind
> >
> > Cc: Joerg Roedel 
> > Cc: Will Deacon 
> > Cc: Shuah Khan 
> > Cc: Christoph Hellwig 
> > Cc: Marek Szyprowski 
> > Cc: Robin Murphy 
> > Signed-off-by: Barry Song 
> > ---
> > -v3:
> >* fix build issues reported by 0day kernel test robot
> > -v2:
> >* add PCI support; v1 supported platform devices only
> >* replace ssleep by msleep_interruptible() to permit users to exit
> >  benchmark before it is completed
> >* many changes according to Robin's suggestions, thanks! Robin
> >  - add standard deviation output to reflect the worst case
> >  - check users' parameters strictly like the number of threads
> >  - make cache dirty before dma_map
> >  - fix unpaired dma_map_page and dma_unmap_single;
> >  - remove redundant "long long" before ktime_to_ns();
> >  - use devm_add_action()
> >
> >   kernel/dma/Kconfig |   8 +
> >   kernel/dma/Makefile|   1 +
> >   kernel/dma/map_benchmark.c | 296
> +
> >   3 files changed, 305 insertions(+)
> >   create mode 100644 kernel/dma/map_benchmark.c
> >
> > diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
> > index c99de4a21458..949c53da5991 100644
> > --- a/kernel/dma/Kconfig
> > +++ b/kernel/dma/Kconfig
> > @@ -225,3 +225,11 @@ config DMA_API_DEBUG_SG
> >   is technically out-of-spec.
> >
> >   If unsure, say N.
> > +
> > +config DMA_MAP_BENCHMARK
> > +   bool "Enable benchmarking of streaming DMA mapping"
> > +   help
> > + Provides /sys/kernel/debug/dma_map_benchmark that helps with
> testing
> > + performance of dma_(un)map_page.
> 
> Since this is a driver, any reason for which it cannot be loadable? If
> so, it seems any functionality would depend on DEBUG FS, I figure that's
> just how we work for debugfs.

We depend on kthread_bind_mask which isn't an export_symbol.
Maybe worth to send a patch to export it?

> 
> Thanks,
> John
> 
> > +
> > + See tools/testing/selftests/dma/dma_map_benchmark.c
> > diff --git a/kernel/dma/Makefile b/kernel/dma/Makefile
> > index dc755ab68aab..7aa6b26b1348 100644
> > --- a/kernel/dma/Makefile
> > +++ b/kernel/dma/Makefile

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH 1/2] dma-mapping: add benchmark support for streaming DMA APIs

2020-10-31 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Song Bao Hua (Barry Song) [mailto:song.bao@hisilicon.com]
> Sent: Saturday, October 31, 2020 10:45 PM
> To: Robin Murphy ;
> iommu@lists.linux-foundation.org; h...@lst.de; m.szyprow...@samsung.com
> Cc: j...@8bytes.org; w...@kernel.org; sh...@kernel.org; Linuxarm
> ; linux-kselft...@vger.kernel.org
> Subject: RE: [PATCH 1/2] dma-mapping: add benchmark support for streaming
> DMA APIs
> 
> 
> 
> > -Original Message-
> > From: Robin Murphy [mailto:robin.mur...@arm.com]
> > Sent: Saturday, October 31, 2020 4:48 AM
> > To: Song Bao Hua (Barry Song) ;
> > iommu@lists.linux-foundation.org; h...@lst.de; m.szyprow...@samsung.com
> > Cc: j...@8bytes.org; w...@kernel.org; sh...@kernel.org; Linuxarm
> > ; linux-kselft...@vger.kernel.org
> > Subject: Re: [PATCH 1/2] dma-mapping: add benchmark support for
> streaming
> > DMA APIs
> >
> > On 2020-10-29 21:39, Song Bao Hua (Barry Song) wrote:
> > [...]
> > >>> +struct map_benchmark {
> > >>> +   __u64 map_nsec;
> > >>> +   __u64 unmap_nsec;
> > >>> +   __u32 threads; /* how many threads will do map/unmap in parallel
> > */
> > >>> +   __u32 seconds; /* how long the test will last */
> > >>> +   int node; /* which numa node this benchmark will run on */
> > >>> +   __u64 expansion[10];/* For future use */
> > >>> +};
> > >>
> > >> I'm no expert on userspace ABIs (and what little experience I do have
> > >> is mostly of Win32...), so hopefully someone else will comment if
> > >> there's anything of concern here. One thing I wonder is that there's
> > >> a fair likelihood of functionality evolving here over time, so might
> > >> it be appropriate to have some sort of explicit versioning parameter
> > >> for robustness?
> > >
> > > I copied that from gup_benchmark. There is no this kind of code to
> > > compare version.
> > > I believe there is a likelihood that kernel module is changed but
> > > users are still using old userspace tool, this might lead to the
> > > incompatible data structure.
> > > But not sure if it is a big problem :-)
> >
> > Yeah, like I say I don't really have a good feeling for what would be best 
> > here,
> > I'm just thinking of what I do know and wary of the potential for a "640 
> > bits
> > ought to be enough for anyone" issue ;)
> >
> > >>> +struct map_benchmark_data {
> > >>> +   struct map_benchmark bparam;
> > >>> +   struct device *dev;
> > >>> +   struct dentry  *debugfs;
> > >>> +   atomic64_t total_map_nsecs;
> > >>> +   atomic64_t total_map_loops;
> > >>> +   atomic64_t total_unmap_nsecs;
> > >>> +   atomic64_t total_unmap_loops;
> > >>> +};
> > >>> +
> > >>> +static int map_benchmark_thread(void *data) {
> > >>> +   struct page *page;
> > >>> +   dma_addr_t dma_addr;
> > >>> +   struct map_benchmark_data *map = data;
> > >>> +   int ret = 0;
> > >>> +
> > >>> +   page = alloc_page(GFP_KERNEL);
> > >>> +   if (!page)
> > >>> +   return -ENOMEM;
> > >>> +
> > >>> +   while (!kthread_should_stop())  {
> > >>> +   ktime_t map_stime, map_etime, unmap_stime, unmap_etime;
> > >>> +
> > >>> +   map_stime = ktime_get();
> > >>> +   dma_addr = dma_map_page(map->dev, page, 0, PAGE_SIZE,
> > >> DMA_BIDIRECTIONAL);
> > >>
> > >> Note that for a non-coherent device, this will give an underestimate
> > >> of the real-world overhead of BIDIRECTIONAL or TO_DEVICE mappings,
> > >> since the page will never be dirty in the cache (except possibly the
> > >> very first time through).
> > >
> > > Agreed. I'd like to add a DIRECTION parameter like "-d 0", "-d 1"
> > > after we have this basic framework.
> >
> > That wasn't so much about the direction itself, just that if it's anything 
> > other
> > than FROM_DEVICE, we should probably do something to dirty the buffer by
> a
> > reasonable amount before each map. Otherwise the measured performance
> is
> > going to be unrealistic on many systems.
> 
> Maybe pu

RE: [PATCH 1/2] dma-mapping: add benchmark support for streaming DMA APIs

2020-10-31 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Robin Murphy [mailto:robin.mur...@arm.com]
> Sent: Saturday, October 31, 2020 4:48 AM
> To: Song Bao Hua (Barry Song) ;
> iommu@lists.linux-foundation.org; h...@lst.de; m.szyprow...@samsung.com
> Cc: j...@8bytes.org; w...@kernel.org; sh...@kernel.org; Linuxarm
> ; linux-kselft...@vger.kernel.org
> Subject: Re: [PATCH 1/2] dma-mapping: add benchmark support for streaming
> DMA APIs
> 
> On 2020-10-29 21:39, Song Bao Hua (Barry Song) wrote:
> [...]
> >>> +struct map_benchmark {
> >>> + __u64 map_nsec;
> >>> + __u64 unmap_nsec;
> >>> + __u32 threads; /* how many threads will do map/unmap in parallel
> */
> >>> + __u32 seconds; /* how long the test will last */
> >>> + int node; /* which numa node this benchmark will run on */
> >>> + __u64 expansion[10];/* For future use */
> >>> +};
> >>
> >> I'm no expert on userspace ABIs (and what little experience I do have
> >> is mostly of Win32...), so hopefully someone else will comment if
> >> there's anything of concern here. One thing I wonder is that there's
> >> a fair likelihood of functionality evolving here over time, so might
> >> it be appropriate to have some sort of explicit versioning parameter
> >> for robustness?
> >
> > I copied that from gup_benchmark. There is no this kind of code to
> > compare version.
> > I believe there is a likelihood that kernel module is changed but
> > users are still using old userspace tool, this might lead to the
> > incompatible data structure.
> > But not sure if it is a big problem :-)
> 
> Yeah, like I say I don't really have a good feeling for what would be best 
> here,
> I'm just thinking of what I do know and wary of the potential for a "640 bits
> ought to be enough for anyone" issue ;)
> 
> >>> +struct map_benchmark_data {
> >>> + struct map_benchmark bparam;
> >>> + struct device *dev;
> >>> + struct dentry  *debugfs;
> >>> + atomic64_t total_map_nsecs;
> >>> + atomic64_t total_map_loops;
> >>> + atomic64_t total_unmap_nsecs;
> >>> + atomic64_t total_unmap_loops;
> >>> +};
> >>> +
> >>> +static int map_benchmark_thread(void *data) {
> >>> + struct page *page;
> >>> + dma_addr_t dma_addr;
> >>> + struct map_benchmark_data *map = data;
> >>> + int ret = 0;
> >>> +
> >>> + page = alloc_page(GFP_KERNEL);
> >>> + if (!page)
> >>> + return -ENOMEM;
> >>> +
> >>> + while (!kthread_should_stop())  {
> >>> + ktime_t map_stime, map_etime, unmap_stime, unmap_etime;
> >>> +
> >>> + map_stime = ktime_get();
> >>> + dma_addr = dma_map_page(map->dev, page, 0, PAGE_SIZE,
> >> DMA_BIDIRECTIONAL);
> >>
> >> Note that for a non-coherent device, this will give an underestimate
> >> of the real-world overhead of BIDIRECTIONAL or TO_DEVICE mappings,
> >> since the page will never be dirty in the cache (except possibly the
> >> very first time through).
> >
> > Agreed. I'd like to add a DIRECTION parameter like "-d 0", "-d 1"
> > after we have this basic framework.
> 
> That wasn't so much about the direction itself, just that if it's anything 
> other
> than FROM_DEVICE, we should probably do something to dirty the buffer by a
> reasonable amount before each map. Otherwise the measured performance is
> going to be unrealistic on many systems.

Maybe put a memset(buf, 0, PAGE_SIZE) before dma_map will help ?

> 
> [...]
> >>> + atomic64_add((long long)ktime_to_ns(ktime_sub(unmap_etime,
> >> unmap_stime)),
> >>> + >total_unmap_nsecs);
> >>> + atomic64_inc(>total_map_loops);
> >>> + atomic64_inc(>total_unmap_loops);
> >>
> >> I think it would be worth keeping track of the variances as well - it
> >> can be hard to tell if a reasonable-looking average is hiding
> >> terrible worst-case behaviour.
> >
> > This is a sensible requirement. I believe it is better to be handled
> > by the existing kernel tracing method.
> >
> > Maybe we need a histogram like:
> > Delay   sample count
> > 1-2us   1000  ***
> > 2-3us   2000  ***
> > 3-4us   100   *
> > .
> > This will be more precise than the maximum latency in the worst case.
> &g

RE: [PATCH 1/2] dma-mapping: add benchmark support for streaming DMA APIs

2020-10-29 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Robin Murphy [mailto:robin.mur...@arm.com]
> Sent: Friday, October 30, 2020 8:38 AM
> To: Song Bao Hua (Barry Song) ;
> iommu@lists.linux-foundation.org; h...@lst.de; m.szyprow...@samsung.com
> Cc: j...@8bytes.org; w...@kernel.org; sh...@kernel.org; Linuxarm
> ; linux-kselft...@vger.kernel.org
> Subject: Re: [PATCH 1/2] dma-mapping: add benchmark support for streaming
> DMA APIs
> 
> On 2020-10-27 03:53, Barry Song wrote:
> > Nowadays, there are increasing requirements to benchmark the performance
> > of dma_map and dma_unmap particually while the device is attached to an
> > IOMMU.
> >
> > This patch enables the support. Users can run specified number of threads
> > to do dma_map_page and dma_unmap_page on a specific NUMA node with
> the
> > specified duration. Then dma_map_benchmark will calculate the average
> > latency for map and unmap.
> >
> > A difficulity for this benchmark is that dma_map/unmap APIs must run on
> > a particular device. Each device might have different backend of IOMMU or
> > non-IOMMU.
> >
> > So we use the driver_override to bind dma_map_benchmark to a particual
> > device by:
> > echo dma_map_benchmark > /sys/bus/platform/devices/xxx/driver_override
> > echo xxx > /sys/bus/platform/drivers/xxx/unbind
> > echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind
> >
> > For this moment, it supports platform device only, PCI device will also
> > be supported afterwards.
> 
> Neat! This is something I've thought about many times, but never got
> round to attempting :)

I am happy you have the same needs. When I came to IOMMU area a half year ago,
the first thing I've done was writing a rough benchmark. At that time, I hacked 
kernel
to get a device behind an IOMMU.

Recently, I got some time to think about how to get "device" without ugly 
hacking and
then clean up code for sending patches out to provide a common benchmark in 
order
that everybody can use.

> 
> I think the basic latency measurement for mapping and unmapping pages is
> enough to start with, but there are definitely some more things that
> would be interesting to look into for future enhancements:
> 
>   - a choice of mapping sizes, both smaller and larger than one page, to
> help characterise stuff like cache maintenance overhead and bounce
> buffer/IOVA fragmentation.
>   - alternative allocation patterns like doing lots of maps first, then
> all their corresponding unmaps (to provoke things like the worst-case
> IOVA rcache behaviour).
>   - ways to exercise a range of those parameters at once across
> different threads in a single test.
> 

Yes, sure. Once we have a basic framework, we can add more benchmark patterns
by using different parameters in the userspace tool:
testing/selftests/dma/dma_map_benchmark.c

Similar function extensions have been carried out in GUP_BENCHMARK.

> But let's get a basic framework nailed down first...

Sure.

> 
> > Cc: Joerg Roedel 
> > Cc: Will Deacon 
> > Cc: Shuah Khan 
> > Cc: Christoph Hellwig 
> > Cc: Marek Szyprowski 
> > Cc: Robin Murphy 
> > Signed-off-by: Barry Song 
> > ---
> >   kernel/dma/Kconfig |   8 ++
> >   kernel/dma/Makefile|   1 +
> >   kernel/dma/map_benchmark.c | 202
> +
> >   3 files changed, 211 insertions(+)
> >   create mode 100644 kernel/dma/map_benchmark.c
> >
> > diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
> > index c99de4a21458..949c53da5991 100644
> > --- a/kernel/dma/Kconfig
> > +++ b/kernel/dma/Kconfig
> > @@ -225,3 +225,11 @@ config DMA_API_DEBUG_SG
> >   is technically out-of-spec.
> >
> >   If unsure, say N.
> > +
> > +config DMA_MAP_BENCHMARK
> > +   bool "Enable benchmarking of streaming DMA mapping"
> > +   help
> > + Provides /sys/kernel/debug/dma_map_benchmark that helps with
> testing
> > + performance of dma_(un)map_page.
> > +
> > + See tools/testing/selftests/dma/dma_map_benchmark.c
> > diff --git a/kernel/dma/Makefile b/kernel/dma/Makefile
> > index dc755ab68aab..7aa6b26b1348 100644
> > --- a/kernel/dma/Makefile
> > +++ b/kernel/dma/Makefile
> > @@ -10,3 +10,4 @@ obj-$(CONFIG_DMA_API_DEBUG)   += debug.o
> >   obj-$(CONFIG_SWIOTLB) += swiotlb.o
> >   obj-$(CONFIG_DMA_COHERENT_POOL)   += pool.o
> >   obj-$(CONFIG_DMA_REMAP)   += remap.o
> > +obj-$(CONFIG_DMA_MAP_BENCHMARK)+= map_benchmark.o
> > diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchma

RE: [PATCH] dma: Per-NUMA-node CMA should depend on NUMA

2020-10-27 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: h...@lst.de [mailto:h...@lst.de]
> Sent: Tuesday, October 27, 2020 8:55 PM
> To: Song Bao Hua (Barry Song) 
> Cc: Robin Murphy ; h...@lst.de;
> iommu@lists.linux-foundation.org; linux-ker...@vger.kernel.org
> Subject: Re: [PATCH] dma: Per-NUMA-node CMA should depend on NUMA
> 
> On Mon, Oct 26, 2020 at 08:07:43PM +0000, Song Bao Hua (Barry Song)
> wrote:
> > > diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
> > > index c99de4a21458..964b74c9b7e3 100644
> > > --- a/kernel/dma/Kconfig
> > > +++ b/kernel/dma/Kconfig
> > > @@ -125,7 +125,8 @@ if  DMA_CMA
> > >
> > >  config DMA_PERNUMA_CMA
> > >   bool "Enable separate DMA Contiguous Memory Area for each NUMA
> > > Node"
> > > - default NUMA && ARM64
> > > + depends on NUMA
> > > + default ARM64
> >
> > On the other hand, at this moment, only ARM64 is calling the init code
> > to get per_numa cma. Do we need to
> > depends on NUMA && ARM64 ?
> > so that this is not enabled by non-arm64?
> 
> I actually hate having arch symbols in common code.  A new
> ARCH_HAS_DMA_PERNUMA_CMA, only selected by arm64 for now would be
> more
> clean I think.

Sounds good to me.

BTW,  +Will.

Last time we talked about default pernuma cma size, you suggested a bootargs
in arch/arm64/Kconfig but Will seems to have different idea. Am I right, Will?

Would we let aarch64 call dma_pernuma_cma_reserve(16MB) rather than
dma_pernuma_cma_reserve()?

In this way, users will at least get a default pernuma CMA which is required
at least by IOMMU. If users set a "cma_pernuma" bootargs, it will overwrite
the default size from aarch64 code?

I mean

- void __init dma_pernuma_cma_reserve(size_t size)
+ void __init dma_pernuma_cma_reserve(size_t size)
{
if (!pernuma_size_bytes)
+   pernuma_size_bytes = size;

}

Right now, it is easy that users will forget to set cma_pernuma in bootargs.
Probably this feature is not enabled by users.

Thanks
Barry
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH] dma: Per-NUMA-node CMA should depend on NUMA

2020-10-26 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Robin Murphy [mailto:robin.mur...@arm.com]
> Sent: Tuesday, October 27, 2020 1:25 AM
> To: h...@lst.de
> Cc: iommu@lists.linux-foundation.org; linux-ker...@vger.kernel.org; Song Bao
> Hua (Barry Song) 
> Subject: [PATCH] dma: Per-NUMA-node CMA should depend on NUMA
> 
> Offering DMA_PERNUMA_CMA to non-NUMA configs is pointless.
> 

This is right.

> Signed-off-by: Robin Murphy 
> ---
>  kernel/dma/Kconfig | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
> index c99de4a21458..964b74c9b7e3 100644
> --- a/kernel/dma/Kconfig
> +++ b/kernel/dma/Kconfig
> @@ -125,7 +125,8 @@ if  DMA_CMA
> 
>  config DMA_PERNUMA_CMA
>   bool "Enable separate DMA Contiguous Memory Area for each NUMA
> Node"
> - default NUMA && ARM64
> + depends on NUMA
> + default ARM64

On the other hand, at this moment, only ARM64 is calling the init code
to get per_numa cma. Do we need to
depends on NUMA && ARM64 ?
so that this is not enabled by non-arm64?

>   help
> Enable this option to get pernuma CMA areas so that devices like
> ARM64 SMMU can get local memory by DMA coherent APIs.
> --
 
Thanks
Barry


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v2 0/2] iommu/arm-smmu-v3: Improve cmdq lock efficiency

2020-09-01 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org
> [mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of John Garry
> Sent: Saturday, August 22, 2020 1:54 AM
> To: w...@kernel.org; robin.mur...@arm.com
> Cc: j...@8bytes.org; linux-arm-ker...@lists.infradead.org;
> iommu@lists.linux-foundation.org; m...@kernel.org; Linuxarm
> ; linux-ker...@vger.kernel.org; John Garry
> 
> Subject: [PATCH v2 0/2] iommu/arm-smmu-v3: Improve cmdq lock efficiency
> 
> As mentioned in [0], the CPU may consume many cycles processing
> arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to
> get space on the queue takes a lot of time once we start getting many CPUs
> contending - from experiment, for 64 CPUs contending the cmdq, success rate
> is ~ 1 in 12, which is poor, but not totally awful.
> 
> This series removes that cmpxchg() and replaces with an atomic_add, same as
> how the actual cmdq deals with maintaining the prod pointer.
> 
> For my NVMe test with 3x NVMe SSDs, I'm getting a ~24% throughput
> increase:
> Before: 1250K IOPs
> After: 1550K IOPs
> 
> I also have a test harness to check the rate of DMA map+unmaps we can
> achieve:
> 
> CPU count 8   16  32  64
> Before:   282K115K36K 11K
> After:302K193K80K 30K
> 
> (unit is map+unmaps per CPU per second)

I have seen performance improvement on hns3 network by sending UDP with 1-32 
threads:

Threads number14   8 16   32
Before patch(TX Mbps)  7636.05  16444.36  21694.48  25746.40   25295.93
After  patch(TX Mbps)  7711.60  16478.98  26561.06  32628.75   33764.56

As you can see, for 8,16,32 threads, network TX throughput improve much. For 1 
and 4 threads,
Tx throughput is almost seem before and after patch. This should be sensible as 
this patch
is mainly for decreasing the lock contention.

> 
> [0]
> https://lore.kernel.org/linux-iommu/B926444035E5E2439431908E3842AFD2
> 4b8...@dggemi525-mbs.china.huawei.com/T/#ma02e301c38c3e94b7725e
> 685757c27e39c7cbde3
> 
> Differences to v1:
> - Simplify by dropping patch to always issue a CMD_SYNC
> - Use 64b atomic add, keeping prod in a separate 32b field
> 
> John Garry (2):
>   iommu/arm-smmu-v3: Calculate max commands per batch
>   iommu/arm-smmu-v3: Remove cmpxchg() in
> arm_smmu_cmdq_issue_cmdlist()
> 
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 166
> ++--
>  1 file changed, 114 insertions(+), 52 deletions(-)
> 
> --
> 2.26.2

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v6 0/2] make dma_alloc_coherent NUMA-aware by per-NUMA CMA

2020-08-31 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org
> [mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Christoph Hellwig
> Sent: Friday, August 21, 2020 6:19 PM
> To: Song Bao Hua (Barry Song) 
> Cc: h...@lst.de; m.szyprow...@samsung.com; robin.mur...@arm.com;
> w...@kernel.org; ganapatrao.kulka...@cavium.com;
> catalin.mari...@arm.com; iommu@lists.linux-foundation.org; Linuxarm
> ; linux-arm-ker...@lists.infradead.org;
> linux-ker...@vger.kernel.org; huangdaode 
> Subject: Re: [PATCH v6 0/2] make dma_alloc_coherent NUMA-aware by
> per-NUMA CMA
> 
> FYI, as of the last one I'm fine now, bit I really need an ACK from
> the arm64 maintainers.

Hi Christoph,

For the changes in arch/arm64, Will gave his ack here:
https://lore.kernel.org/linux-iommu/20200821090116.GB20255@willie-the-truck/

and the patchset has been refined to v8
https://lore.kernel.org/linux-iommu/20200823230309.28980-1-song.bao@hisilicon.com/
with one additional patch to remove magic number:
[PATCH v8 3/3] mm: cma: use CMA_MAX_NAME to define the length of cma name array
https://lore.kernel.org/linux-iommu/20200823230309.28980-4-song.bao@hisilicon.com/

Hopefully, you didn't miss it:-)
Does the new one need an Ack from Linux-mm maintainer?

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH] iommu/arm-smmu-v3: add tracepoints for cmdq_issue_cmdlist

2020-08-28 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Robin Murphy [mailto:robin.mur...@arm.com]
> Sent: Friday, August 28, 2020 11:18 PM
> To: Song Bao Hua (Barry Song) ; Will Deacon
> 
> Cc: iommu@lists.linux-foundation.org; linux-arm-ker...@lists.infradead.org;
> j...@8bytes.org; Linuxarm 
> Subject: Re: [PATCH] iommu/arm-smmu-v3: add tracepoints for
> cmdq_issue_cmdlist
> 
> On 2020-08-28 12:02, Song Bao Hua (Barry Song) wrote:
> >
> >
> >> -Original Message-
> >> From: Will Deacon [mailto:w...@kernel.org]
> >> Sent: Friday, August 28, 2020 10:29 PM
> >> To: Song Bao Hua (Barry Song) 
> >> Cc: iommu@lists.linux-foundation.org;
> linux-arm-ker...@lists.infradead.org;
> >> robin.mur...@arm.com; j...@8bytes.org; Linuxarm
> 
> >> Subject: Re: [PATCH] iommu/arm-smmu-v3: add tracepoints for
> >> cmdq_issue_cmdlist
> >>
> >> On Thu, Aug 27, 2020 at 09:33:51PM +1200, Barry Song wrote:
> >>> cmdq_issue_cmdlist() is the hotspot that uses a lot of time. This patch
> >>> adds tracepoints for it to help debug.
> >>>
> >>> Signed-off-by: Barry Song 
> >>> ---
> >>>   * can furthermore develop an eBPF program to benchmark using this
> trace
> >>
> >> Hmm, don't these things have a history of becoming ABI? If so, I don't
> >> really want them in the driver at all, sorry. Do other drivers overcome
> >> this somehow?
> >
> > This kind of tracepoints mainly works as a low-overhead probe point for
> debug purpose. I don't think any
> > application would depend on it. It is for debugging. And there are lots of
> tracepoints in other drivers
> > even in iommu driver core and intel_iommu driver :-)
> >
> > developers use it in one of the below ways:
> >
> > 1. get trace print from the ring buffer by reading debugfs
> > root@ubuntu:/sys/kernel/debug/tracing/events/arm_smmu_v3# echo 1 >
> enable
> > # cat /sys/kernel/debug/tracing/trace_pipe
> > -0 [058] ..s1 125444.768083: issue_cmdlist_exit:
> arm-smmu-v3.2.auto cmd number=1 sync=1
> >-0 [058] ..s1 125444.768084: issue_cmdlist_entry:
> arm-smmu-v3.2.auto cmd number=1 sync=1
> >-0 [058] ..s1 125444.768085: issue_cmdlist_exit:
> arm-smmu-v3.2.auto cmd number=1 sync=1
> >-0 [058] ..s1 125444.768165: issue_cmdlist_entry:
> arm-smmu-v3.2.auto cmd number=1 sync=1
> >-0 [058] ..s1 125444.768168: issue_cmdlist_exit:
> arm-smmu-v3.2.auto cmd number=1 sync=1
> >-0 [058] ..s1 125444.768169: issue_cmdlist_entry:
> arm-smmu-v3.2.auto cmd number=1 sync=1
> >-0 [058] ..s1 125444.768171: issue_cmdlist_exit:
> arm-smmu-v3.2.auto cmd number=1 sync=1
> >-0 [058] ..s1 125444.768259: issue_cmdlist_entry:
> arm-smmu-v3.2.auto cmd number=1 sync=1
> >...
> >
> > This can replace printk with much much lower overhead.
> >
> > 2. add a hook function in tracepoint to do some latency measure and time
> statistics just like the eBPF example
> > I gave after the commit log.
> >
> > Using it, I can get the histogram of the execution time of
> cmdq_issue_cmdlist():
> > nsecs   : count distribution
> >   0 -> 1  : 0|
> |
> >   2 -> 3  : 0|
> |
> >   4 -> 7  : 0|
> |
> >   8 -> 15 : 0|
> |
> >  16 -> 31 : 0|
> |
> >  32 -> 63 : 0|
> |
> >  64 -> 127: 0|
> |
> > 128 -> 255: 0|
> |
> > 256 -> 511: 0|
> |
> > 512 -> 1023   : 58   |
> |
> >1024 -> 2047   : 22763
> ||
> >2048 -> 4095   : 13238|***
> |
> >
> > I feel it is very common to do this kind of things for analyzing the
> performance issue. For example, to easy the analysis
> > of softirq latency, softirq.c has the below code:
> >
> > asmlinkage __visible void __softirq_entry __do_softirq(void)
> > {
> > ...
> > trace_softirq_entry(vec_nr);
> > h->action(h);
> > trace_softirq_exit(vec_nr);
> > ...
> > }
> 
> If you only want to measure entry and exit of one specific function,
> though, can't the function graph tracer already do that?

Function graph is able to do this specific thi

RE: [PATCH] iommu/arm-smmu-v3: add tracepoints for cmdq_issue_cmdlist

2020-08-28 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Will Deacon [mailto:w...@kernel.org]
> Sent: Friday, August 28, 2020 10:29 PM
> To: Song Bao Hua (Barry Song) 
> Cc: iommu@lists.linux-foundation.org; linux-arm-ker...@lists.infradead.org;
> robin.mur...@arm.com; j...@8bytes.org; Linuxarm 
> Subject: Re: [PATCH] iommu/arm-smmu-v3: add tracepoints for
> cmdq_issue_cmdlist
> 
> On Thu, Aug 27, 2020 at 09:33:51PM +1200, Barry Song wrote:
> > cmdq_issue_cmdlist() is the hotspot that uses a lot of time. This patch
> > adds tracepoints for it to help debug.
> >
> > Signed-off-by: Barry Song 
> > ---
> >  * can furthermore develop an eBPF program to benchmark using this trace
> 
> Hmm, don't these things have a history of becoming ABI? If so, I don't
> really want them in the driver at all, sorry. Do other drivers overcome
> this somehow?

This kind of tracepoints mainly works as a low-overhead probe point for debug 
purpose. I don't think any
application would depend on it. It is for debugging. And there are lots of 
tracepoints in other drivers
even in iommu driver core and intel_iommu driver :-)

developers use it in one of the below ways:

1. get trace print from the ring buffer by reading debugfs
root@ubuntu:/sys/kernel/debug/tracing/events/arm_smmu_v3# echo 1 > enable
# cat /sys/kernel/debug/tracing/trace_pipe
-0 [058] ..s1 125444.768083: issue_cmdlist_exit: arm-smmu-v3.2.auto 
cmd number=1 sync=1
  -0 [058] ..s1 125444.768084: issue_cmdlist_entry: 
arm-smmu-v3.2.auto cmd number=1 sync=1   
  -0 [058] ..s1 125444.768085: issue_cmdlist_exit: 
arm-smmu-v3.2.auto cmd number=1 sync=1
  -0 [058] ..s1 125444.768165: issue_cmdlist_entry: 
arm-smmu-v3.2.auto cmd number=1 sync=1   
  -0 [058] ..s1 125444.768168: issue_cmdlist_exit: 
arm-smmu-v3.2.auto cmd number=1 sync=1
  -0 [058] ..s1 125444.768169: issue_cmdlist_entry: 
arm-smmu-v3.2.auto cmd number=1 sync=1   
  -0 [058] ..s1 125444.768171: issue_cmdlist_exit: 
arm-smmu-v3.2.auto cmd number=1 sync=1
  -0 [058] ..s1 125444.768259: issue_cmdlist_entry: 
arm-smmu-v3.2.auto cmd number=1 sync=1   
  ...

This can replace printk with much much lower overhead.

2. add a hook function in tracepoint to do some latency measure and time 
statistics just like the eBPF example
I gave after the commit log.

Using it, I can get the histogram of the execution time of cmdq_issue_cmdlist():
   nsecs   : count distribution 
 0 -> 1  : 0|| 
 2 -> 3  : 0|| 
 4 -> 7  : 0|| 
 8 -> 15 : 0|| 
16 -> 31 : 0|| 
32 -> 63 : 0|| 
64 -> 127: 0|| 
   128 -> 255: 0|| 
   256 -> 511: 0|| 
   512 -> 1023   : 58   || 
  1024 -> 2047   : 22763|| 
  2048 -> 4095   : 13238|*** | 

I feel it is very common to do this kind of things for analyzing the 
performance issue. For example, to easy the analysis
of softirq latency, softirq.c has the below code:

asmlinkage __visible void __softirq_entry __do_softirq(void)
{
...
trace_softirq_entry(vec_nr);
h->action(h);
trace_softirq_exit(vec_nr);
...
}

> 
> Will

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH] iommu/arm-smmu-v3: add tracepoints for cmdq_issue_cmdlist

2020-08-28 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Jean-Philippe Brucker [mailto:jean-phili...@linaro.org]
> Sent: Friday, August 28, 2020 7:41 PM
> To: Song Bao Hua (Barry Song) 
> Cc: iommu@lists.linux-foundation.org; linux-arm-ker...@lists.infradead.org;
> robin.mur...@arm.com; w...@kernel.org; Linuxarm 
> Subject: Re: [PATCH] iommu/arm-smmu-v3: add tracepoints for
> cmdq_issue_cmdlist
> 
> Hi,
> 
> On Thu, Aug 27, 2020 at 09:33:51PM +1200, Barry Song wrote:
> > cmdq_issue_cmdlist() is the hotspot that uses a lot of time. This
> > patch adds tracepoints for it to help debug.
> >
> > Signed-off-by: Barry Song 
> > ---
> >  * can furthermore develop an eBPF program to benchmark using this
> > trace
> 
> Have you tried using kprobe and kretprobe instead of tracepoints?
> Any noticeable performance drop?

Yes. Pls read this email.
kprobe overhead and OPTPROBES implementation on ARM64
https://www.spinics.net/lists/arm-kernel/msg828788.html

> 
> Thanks,
> Jean
> 
> >
> >   cmdlistlat.c:
> > #include 
> >
> > BPF_HASH(start, u32);
> > BPF_HISTOGRAM(dist);
> >
> > TRACEPOINT_PROBE(arm_smmu_v3, issue_cmdlist_entry) {
> > u32 pid;
> > u64 ts, *val;
> >
> > pid = bpf_get_current_pid_tgid();
> > ts = bpf_ktime_get_ns();
> > start.update(, );
> > return 0;
> > }
> >
> > TRACEPOINT_PROBE(arm_smmu_v3, issue_cmdlist_exit) {
> > u32 pid;
> > u64 *tsp, delta;
> >
> > pid = bpf_get_current_pid_tgid();
> > tsp = start.lookup();
> >
> > if (tsp != 0) {
> > delta = bpf_ktime_get_ns() - *tsp;
> > dist.increment(bpf_log2l(delta));
> > start.delete();
> > }
> >
> > return 0;
> > }
> >
> >  cmdlistlat.py:
> > #!/usr/bin/python3
> > #
> > from __future__ import print_function
> > from bcc import BPF
> > from ctypes import c_ushort, c_int, c_ulonglong from time import sleep
> > from sys import argv
> >
> > def usage():
> > print("USAGE: %s [interval [count]]" % argv[0])
> > exit()
> >
> > # arguments
> > interval = 5
> > count = -1
> > if len(argv) > 1:
> > try:
> > interval = int(argv[1])
> > if interval == 0:
> > raise
> > if len(argv) > 2:
> > count = int(argv[2])
> > except: # also catches -h, --help
> > usage()
> >
> > # load BPF program
> > b = BPF(src_file = "cmdlistlat.c")
> >
> > # header
> > print("Tracing... Hit Ctrl-C to end.")
> >
> > # output
> > loop = 0
> > do_exit = 0
> > while (1):
> > if count > 0:
> > loop += 1
> > if loop > count:
> > exit()
> > try:
> > sleep(interval)
> > except KeyboardInterrupt:
> > pass; do_exit = 1
> >
> > print()
> > b["dist"].print_log2_hist("nsecs")
> > b["dist"].clear()
> > if do_exit:
> > exit()
> >
> >
> >  drivers/iommu/arm/arm-smmu-v3/Makefile|  1 +
> >  .../iommu/arm/arm-smmu-v3/arm-smmu-v3-trace.h | 48
> +++
> >  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   |  8 
> >  3 files changed, 57 insertions(+)
> >  create mode 100644
> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-trace.h
> >
> > diff --git a/drivers/iommu/arm/arm-smmu-v3/Makefile
> > b/drivers/iommu/arm/arm-smmu-v3/Makefile
> > index 569e24e9f162..dba1087f91f3 100644
> > --- a/drivers/iommu/arm/arm-smmu-v3/Makefile
> > +++ b/drivers/iommu/arm/arm-smmu-v3/Makefile
> > @@ -1,2 +1,3 @@
> >  # SPDX-License-Identifier: GPL-2.0
> > +ccflags-y += -I$(src)   # needed for trace events
> >  obj-$(CONFIG_ARM_SMMU_V3) += arm-smmu-v3.o diff --git
> > a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-trace.h
> > b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-trace.h
> > new file mode 100644
> > index ..29ab96706124
> > --- /dev/null
> > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-trace.h
> > @@ -0,0 +1,48 @@
> > +/* SPDX-License-Identifier: GPL-2.0-only */
> > +/*
> > + * Copyright (C) 2020 Hisilicon Limited.
>

RE: [PATCH v7 0/3] make dma_alloc_coherent NUMA-aware by per-NUMA CMA

2020-08-21 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Song Bao Hua (Barry Song)
> Sent: Saturday, August 22, 2020 7:27 AM
> To: 'Mike Kravetz' ; h...@lst.de;
> m.szyprow...@samsung.com; robin.mur...@arm.com; w...@kernel.org;
> ganapatrao.kulka...@cavium.com; catalin.mari...@arm.com;
> a...@linux-foundation.org
> Cc: iommu@lists.linux-foundation.org; linux-arm-ker...@lists.infradead.org;
> linux-ker...@vger.kernel.org; Zengtao (B) ;
> huangdaode ; Linuxarm 
> Subject: RE: [PATCH v7 0/3] make dma_alloc_coherent NUMA-aware by
> per-NUMA CMA
> 
> 
> 
> > -Original Message-
> > From: Mike Kravetz [mailto:mike.krav...@oracle.com]
> > Sent: Saturday, August 22, 2020 5:53 AM
> > To: Song Bao Hua (Barry Song) ; h...@lst.de;
> > m.szyprow...@samsung.com; robin.mur...@arm.com; w...@kernel.org;
> > ganapatrao.kulka...@cavium.com; catalin.mari...@arm.com;
> > a...@linux-foundation.org
> > Cc: iommu@lists.linux-foundation.org; linux-arm-ker...@lists.infradead.org;
> > linux-ker...@vger.kernel.org; Zengtao (B) ;
> > huangdaode ; Linuxarm
> 
> > Subject: Re: [PATCH v7 0/3] make dma_alloc_coherent NUMA-aware by
> > per-NUMA CMA
> >
> > Hi Barry,
> > Sorry for jumping in so late.
> >
> > On 8/21/20 4:33 AM, Barry Song wrote:
> > >
> > > with per-numa CMA, smmu will get memory from local numa node to save
> > command
> > > queues and page tables. that means dma_unmap latency will be shrunk
> > much.
> >
> > Since per-node CMA areas for hugetlb was introduced, I have been thinking
> > about the limited number of CMA areas.  In most configurations, I believe
> > it is limited to 7.  And, IIRC it is not something that can be changed at
> > runtime, you need to reconfig and rebuild to increase the number.  In
> contrast
> > some configs have NODES_SHIFT set to 10.  I wasn't too worried because of
> > the limited hugetlb use case.  However, this series is adding another user
> > of per-node CMA areas.
> >
> > With more users, should try to sync up number of CMA areas and number of
> > nodes?  Or, perhaps I am worrying about nothing?
> 
> Hi Mike,
> The current limitation is 8. If the server has 4 nodes and we enable both
> pernuma
> CMA and hugetlb, the last node will fail to get one cma area as the default
> global cma area will take 1 of 8. So users need to change menuconfig.
> If the server has 8 nodes, we enable one of pernuma cma and hugetlb, one
> node
> will fail to get cma.
> 
> We may set the default number of CMA areas as 8+MAX_NODES(if hugetlb
> enabled) +
> MAX_NODES(if pernuma cma enabled) if we don't expect users to change
> config, but
> right now hugetlb has not an option in Kconfig to enable or disable like
> pernuma cma
> has DMA_PERNUMA_CMA.

I would prefer we make some changes like:

config CMA_AREAS
int "Maximum count of the CMA areas"
depends on CMA
+   default 19 if NUMA
default 7
help
  CMA allows to create CMA areas for particular purpose, mainly,
  used as device private area. This parameter sets the maximum
  number of CMA area in the system.

- If unsure, leave the default value "7".
+ If unsure, leave the default value "7" or "19" if NUMA is used.

1+ CONFIG_CMA_AREAS should be quite enough for almost all servers in the 
markets.

If 2 numa nodes, and both hugetlb cma and pernuma cma is enabled, we need 2*2 + 
1 = 5
If 4 numa nodes, and both hugetlb cma and pernuma cma is enabled, we need 2*4 + 
1 = 9-> default ARM64 config.
If 8 numa nodes, and both hugetlb cma and pernuma cma is enabled, we need 2*8 + 
1 = 17

The default value is supporting the most common case and is not going to 
support those servers
with NODES_SHIFT=10, they can make their own config just like users need to 
increase CMA_AREAS
if they add many cma areas in device tree in a system even without NUMA.

How do you think, mike?

Thanks
Barry
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v7 0/3] make dma_alloc_coherent NUMA-aware by per-NUMA CMA

2020-08-21 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Mike Kravetz [mailto:mike.krav...@oracle.com]
> Sent: Saturday, August 22, 2020 5:53 AM
> To: Song Bao Hua (Barry Song) ; h...@lst.de;
> m.szyprow...@samsung.com; robin.mur...@arm.com; w...@kernel.org;
> ganapatrao.kulka...@cavium.com; catalin.mari...@arm.com;
> a...@linux-foundation.org
> Cc: iommu@lists.linux-foundation.org; linux-arm-ker...@lists.infradead.org;
> linux-ker...@vger.kernel.org; Zengtao (B) ;
> huangdaode ; Linuxarm 
> Subject: Re: [PATCH v7 0/3] make dma_alloc_coherent NUMA-aware by
> per-NUMA CMA
> 
> Hi Barry,
> Sorry for jumping in so late.
> 
> On 8/21/20 4:33 AM, Barry Song wrote:
> >
> > with per-numa CMA, smmu will get memory from local numa node to save
> command
> > queues and page tables. that means dma_unmap latency will be shrunk
> much.
> 
> Since per-node CMA areas for hugetlb was introduced, I have been thinking
> about the limited number of CMA areas.  In most configurations, I believe
> it is limited to 7.  And, IIRC it is not something that can be changed at
> runtime, you need to reconfig and rebuild to increase the number.  In contrast
> some configs have NODES_SHIFT set to 10.  I wasn't too worried because of
> the limited hugetlb use case.  However, this series is adding another user
> of per-node CMA areas.
> 
> With more users, should try to sync up number of CMA areas and number of
> nodes?  Or, perhaps I am worrying about nothing?

Hi Mike,
The current limitation is 8. If the server has 4 nodes and we enable both 
pernuma
CMA and hugetlb, the last node will fail to get one cma area as the default
global cma area will take 1 of 8. So users need to change menuconfig.
If the server has 8 nodes, we enable one of pernuma cma and hugetlb, one node
will fail to get cma.

We may set the default number of CMA areas as 8+MAX_NODES(if hugetlb enabled) +
MAX_NODES(if pernuma cma enabled) if we don't expect users to change config, but
right now hugetlb has not an option in Kconfig to enable or disable like 
pernuma cma
has DMA_PERNUMA_CMA.

> --
> Mike Kravetz

Thanks
Barry
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v7 1/3] dma-contiguous: provide the ability to reserve per-numa CMA

2020-08-21 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Randy Dunlap [mailto:rdun...@infradead.org]
> Sent: Saturday, August 22, 2020 4:08 AM
> To: Song Bao Hua (Barry Song) ; h...@lst.de;
> m.szyprow...@samsung.com; robin.mur...@arm.com; w...@kernel.org;
> ganapatrao.kulka...@cavium.com; catalin.mari...@arm.com;
> a...@linux-foundation.org
> Cc: iommu@lists.linux-foundation.org; linux-arm-ker...@lists.infradead.org;
> linux-ker...@vger.kernel.org; Zengtao (B) ;
> huangdaode ; Linuxarm ;
> Jonathan Cameron ; Nicolas Saenz Julienne
> ; Steve Capper ; Mike
> Rapoport 
> Subject: Re: [PATCH v7 1/3] dma-contiguous: provide the ability to reserve
> per-numa CMA
> 
> On 8/21/20 4:33 AM, Barry Song wrote:
> > ---
> >  -v7: with respect to Will's comments
> >  * move to use for_each_online_node
> >  * add description if users don't specify pernuma_cma
> >  * provide default value for CONFIG_DMA_PERNUMA_CMA
> >
> >  .../admin-guide/kernel-parameters.txt |  11 ++
> >  include/linux/dma-contiguous.h|   6 ++
> >  kernel/dma/Kconfig|  11 ++
> >  kernel/dma/contiguous.c   | 100
> --
> >  4 files changed, 118 insertions(+), 10 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt
> b/Documentation/admin-guide/kernel-parameters.txt
> > index bdc1f33fd3d1..c609527fc35a 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -599,6 +599,17 @@
> > altogether. For more information, see
> > include/linux/dma-contiguous.h
> >
> > +   pernuma_cma=nn[MG]
> > +   [ARM64,KNL]
> > +   Sets the size of kernel per-numa memory area for
> > +   contiguous memory allocations. A value of 0 disables
> > +   per-numa CMA altogether. And If this option is not
> > +   specificed, the default value is 0.
> > +   With per-numa CMA enabled, DMA users on node nid will
> > +   first try to allocate buffer from the pernuma area
> > +   which is located in node nid, if the allocation fails,
> > +   they will fallback to the global default memory area.
> > +
> 
> Entries in kernel-parameters.txt are supposed to be in alphabetical order
> but this one is not.  If you want to keep it near the cma= entry, you can
> rename it like Mike suggested.  Otherwise it needs to be moved.

As I've replied in Mike's comment, I'd like to rename it to cma_per...

> 
> 
> > cmo_free_hint=  [PPC] Format: { yes | no }
> > Specify whether pages are marked as being inactive
> > when they are freed.  This is used in CMO environments
> 
> 
> 
> --
> ~Randy

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v7 1/3] dma-contiguous: provide the ability to reserve per-numa CMA

2020-08-21 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Mike Rapoport [mailto:r...@linux.ibm.com]
> Sent: Saturday, August 22, 2020 2:28 AM
> To: Song Bao Hua (Barry Song) 
> Cc: h...@lst.de; m.szyprow...@samsung.com; robin.mur...@arm.com;
> w...@kernel.org; ganapatrao.kulka...@cavium.com;
> catalin.mari...@arm.com; a...@linux-foundation.org;
> iommu@lists.linux-foundation.org; linux-arm-ker...@lists.infradead.org;
> linux-ker...@vger.kernel.org; Zengtao (B) ;
> huangdaode ; Linuxarm ;
> Jonathan Cameron ; Nicolas Saenz Julienne
> ; Steve Capper 
> Subject: Re: [PATCH v7 1/3] dma-contiguous: provide the ability to reserve
> per-numa CMA
> 
> On Fri, Aug 21, 2020 at 11:33:53PM +1200, Barry Song wrote:
> > Right now, drivers like ARM SMMU are using dma_alloc_coherent() to get
> > coherent DMA buffers to save their command queues and page tables. As
> > there is only one default CMA in the whole system, SMMUs on nodes other
> > than node0 will get remote memory. This leads to significant latency.
> >
> > This patch provides per-numa CMA so that drivers like SMMU can get local
> > memory. Tests show localizing CMA can decrease dma_unmap latency much.
> > For instance, before this patch, SMMU on node2  has to wait for more than
> > 560ns for the completion of CMD_SYNC in an empty command queue; with
> this
> > patch, it needs 240ns only.
> >
> > A positive side effect of this patch would be improving performance even
> > further for those users who are worried about performance more than DMA
> > security and use iommu.passthrough=1 to skip IOMMU. With local CMA, all
> > drivers can get local coherent DMA buffers.
> >
> > Cc: Jonathan Cameron 
> > Cc: Christoph Hellwig 
> > Cc: Marek Szyprowski 
> > Cc: Will Deacon 
> > Cc: Robin Murphy 
> > Cc: Ganapatrao Kulkarni 
> > Cc: Catalin Marinas 
> > Cc: Nicolas Saenz Julienne 
> > Cc: Steve Capper 
> > Cc: Andrew Morton 
> > Cc: Mike Rapoport 
> > Signed-off-by: Barry Song 
> > ---
> >  -v7: with respect to Will's comments
> >  * move to use for_each_online_node
> >  * add description if users don't specify pernuma_cma
> >  * provide default value for CONFIG_DMA_PERNUMA_CMA
> >
> >  .../admin-guide/kernel-parameters.txt |  11 ++
> >  include/linux/dma-contiguous.h|   6 ++
> >  kernel/dma/Kconfig|  11 ++
> >  kernel/dma/contiguous.c   | 100
> --
> >  4 files changed, 118 insertions(+), 10 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt
> b/Documentation/admin-guide/kernel-parameters.txt
> > index bdc1f33fd3d1..c609527fc35a 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -599,6 +599,17 @@
> > altogether. For more information, see
> > include/linux/dma-contiguous.h
> >
> > +   pernuma_cma=nn[MG]
> 
> Maybe cma_pernuma or cma_pernode?

Sounds good.

> 
> > +   [ARM64,KNL]
> > +   Sets the size of kernel per-numa memory area for
> > +   contiguous memory allocations. A value of 0 disables
> > +   per-numa CMA altogether. And If this option is not
> > +   specificed, the default value is 0.
> > +   With per-numa CMA enabled, DMA users on node nid will
> > +   first try to allocate buffer from the pernuma area
> > +   which is located in node nid, if the allocation fails,
> > +   they will fallback to the global default memory area.
> > +
> > cmo_free_hint=  [PPC] Format: { yes | no }
> > Specify whether pages are marked as being inactive
> > when they are freed.  This is used in CMO environments
> > diff --git a/include/linux/dma-contiguous.h
> b/include/linux/dma-contiguous.h
> > index 03f8e98e3bcc..fe55e004f1f4 100644
> > --- a/include/linux/dma-contiguous.h
> > +++ b/include/linux/dma-contiguous.h
> > @@ -171,6 +171,12 @@ static inline void dma_free_contiguous(struct
> device *dev, struct page *page,
> >
> >  #endif
> >
> > +#ifdef CONFIG_DMA_PERNUMA_CMA
> > +void dma_pernuma_cma_reserve(void);
> > +#else
> > +static inline void dma_pernuma_cma_reserve(void) { }
> > +#endif
> > +
> >  #endif
> >
> >  #endif
> > diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
> > index 847a9d1fa634..c38979d45b13 100644
> &

RE: [PATCH v6 1/2] dma-contiguous: provide the ability to reserve per-numa CMA

2020-08-21 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Will Deacon [mailto:w...@kernel.org]
> Sent: Friday, August 21, 2020 9:27 PM
> To: Song Bao Hua (Barry Song) 
> Cc: h...@lst.de; m.szyprow...@samsung.com; robin.mur...@arm.com;
> ganapatrao.kulka...@cavium.com; catalin.mari...@arm.com;
> iommu@lists.linux-foundation.org; Linuxarm ;
> linux-arm-ker...@lists.infradead.org; linux-ker...@vger.kernel.org;
> huangdaode ; Jonathan Cameron
> ; Nicolas Saenz Julienne
> ; Steve Capper ; Andrew
> Morton ; Mike Rapoport 
> Subject: Re: [PATCH v6 1/2] dma-contiguous: provide the ability to reserve
> per-numa CMA
> 
> On Fri, Aug 21, 2020 at 09:13:39AM +, Song Bao Hua (Barry Song) wrote:
> >
> >
> > > -Original Message-
> > > From: Will Deacon [mailto:w...@kernel.org]
> > > Sent: Friday, August 21, 2020 8:47 PM
> > > To: Song Bao Hua (Barry Song) 
> > > Cc: h...@lst.de; m.szyprow...@samsung.com; robin.mur...@arm.com;
> > > ganapatrao.kulka...@cavium.com; catalin.mari...@arm.com;
> > > iommu@lists.linux-foundation.org; Linuxarm ;
> > > linux-arm-ker...@lists.infradead.org; linux-ker...@vger.kernel.org;
> > > huangdaode ; Jonathan Cameron
> > > ; Nicolas Saenz Julienne
> > > ; Steve Capper ;
> > > Andrew Morton ; Mike Rapoport
> > > 
> > > Subject: Re: [PATCH v6 1/2] dma-contiguous: provide the ability to
> > > reserve per-numa CMA
> > >
> > > On Fri, Aug 21, 2020 at 02:26:14PM +1200, Barry Song wrote:
> > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt
> > > b/Documentation/admin-guide/kernel-parameters.txt
> > > > index bdc1f33fd3d1..3f33b89aeab5 100644
> > > > --- a/Documentation/admin-guide/kernel-parameters.txt
> > > > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > > > @@ -599,6 +599,15 @@
> > > > altogether. For more information, see
> > > > include/linux/dma-contiguous.h
> > > >
> > > > +   pernuma_cma=nn[MG]
> > > > +   [ARM64,KNL]
> > > > +   Sets the size of kernel per-numa memory area for
> > > > +   contiguous memory allocations. A value of 0 
> > > > disables
> > > > +   per-numa CMA altogether. DMA users on node nid 
> > > > will
> > > > +   first try to allocate buffer from the pernuma 
> > > > area
> > > > +   which is located in node nid, if the allocation 
> > > > fails,
> > > > +   they will fallback to the global default memory 
> > > > area.
> > >
> > > What is the default behaviour if this option is not specified? Seems
> > > like that should be mentioned here.
> 
> Just wanted to make sure you didn't miss this ^^

If it is not specified, the default size is 0 that means pernuma_cma is 
disabled.

Will put some words for this.

> 
> > >
> > > > diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig index
> > > > 847a9d1fa634..db7a37ed35eb 100644
> > > > --- a/kernel/dma/Kconfig
> > > > +++ b/kernel/dma/Kconfig
> > > > @@ -118,6 +118,16 @@ config DMA_CMA
> > > >   If unsure, say "n".
> > > >
> > > >  if  DMA_CMA
> > > > +
> > > > +config DMA_PERNUMA_CMA
> > > > +   bool "Enable separate DMA Contiguous Memory Area for each
> NUMA
> > > Node"
> > >
> > > I don't understand the need for this config option. If you have
> > > DMA_DMA and you have NUMA, why wouldn't you want this enabled?
> >
> > Christoph preferred this in previous patchset in order to be able to
> > remove all of the code in the text if users don't use pernuma CMA.
> 
> Ok, I defer to Christoph here, but maybe a "default NUMA" might work?

maybe "default NUMA && ARM64"?
Though I believe it will benefit x86, but I don't have a x86 server hardware
and real scenario to test. So I haven't put the dma_pernuma_cma_reserve()
code in arch/x86.
Hopefully some x86 guys will bring it up and remove the "&& ARM64".

> 
> > > > +   help
> > > > + Enable this option to get pernuma CMA areas so that devices 
> > > > like
> > > > + ARM64 SMMU can get local memory by DMA coherent APIs.
> > > > +
> > > > + You can set the size of pernuma CMA by specifying

RE: [PATCH v6 1/2] dma-contiguous: provide the ability to reserve per-numa CMA

2020-08-21 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Will Deacon [mailto:w...@kernel.org]
> Sent: Friday, August 21, 2020 8:47 PM
> To: Song Bao Hua (Barry Song) 
> Cc: h...@lst.de; m.szyprow...@samsung.com; robin.mur...@arm.com;
> ganapatrao.kulka...@cavium.com; catalin.mari...@arm.com;
> iommu@lists.linux-foundation.org; Linuxarm ;
> linux-arm-ker...@lists.infradead.org; linux-ker...@vger.kernel.org;
> huangdaode ; Jonathan Cameron
> ; Nicolas Saenz Julienne
> ; Steve Capper ; Andrew
> Morton ; Mike Rapoport 
> Subject: Re: [PATCH v6 1/2] dma-contiguous: provide the ability to reserve
> per-numa CMA
> 
> On Fri, Aug 21, 2020 at 02:26:14PM +1200, Barry Song wrote:
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt
> b/Documentation/admin-guide/kernel-parameters.txt
> > index bdc1f33fd3d1..3f33b89aeab5 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -599,6 +599,15 @@
> > altogether. For more information, see
> > include/linux/dma-contiguous.h
> >
> > +   pernuma_cma=nn[MG]
> > +   [ARM64,KNL]
> > +   Sets the size of kernel per-numa memory area for
> > +   contiguous memory allocations. A value of 0 disables
> > +   per-numa CMA altogether. DMA users on node nid will
> > +   first try to allocate buffer from the pernuma area
> > +   which is located in node nid, if the allocation fails,
> > +   they will fallback to the global default memory area.
> 
> What is the default behaviour if this option is not specified? Seems like
> that should be mentioned here.
> 
> > diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
> > index 847a9d1fa634..db7a37ed35eb 100644
> > --- a/kernel/dma/Kconfig
> > +++ b/kernel/dma/Kconfig
> > @@ -118,6 +118,16 @@ config DMA_CMA
> >   If unsure, say "n".
> >
> >  if  DMA_CMA
> > +
> > +config DMA_PERNUMA_CMA
> > +   bool "Enable separate DMA Contiguous Memory Area for each NUMA
> Node"
> 
> I don't understand the need for this config option. If you have DMA_DMA and
> you have NUMA, why wouldn't you want this enabled?

Christoph preferred this in previous patchset in order to be able to remove all 
of the code
in the text if users don't use pernuma CMA.

> 
> > +   help
> > + Enable this option to get pernuma CMA areas so that devices like
> > + ARM64 SMMU can get local memory by DMA coherent APIs.
> > +
> > + You can set the size of pernuma CMA by specifying
> "pernuma_cma=size"
> > + on the kernel's command line.
> > +
> >  comment "Default contiguous memory area size:"
> >
> >  config CMA_SIZE_MBYTES
> > diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c
> > index cff7e60968b9..89b95f10e56d 100644
> > --- a/kernel/dma/contiguous.c
> > +++ b/kernel/dma/contiguous.c
> > @@ -69,6 +69,19 @@ static int __init early_cma(char *p)
> >  }
> >  early_param("cma", early_cma);
> >
> > +#ifdef CONFIG_DMA_PERNUMA_CMA
> > +
> > +static struct cma *dma_contiguous_pernuma_area[MAX_NUMNODES];
> > +static phys_addr_t pernuma_size_bytes __initdata;
> > +
> > +static int __init early_pernuma_cma(char *p)
> > +{
> > +   pernuma_size_bytes = memparse(p, );
> > +   return 0;
> > +}
> > +early_param("pernuma_cma", early_pernuma_cma);
> > +#endif
> > +
> >  #ifdef CONFIG_CMA_SIZE_PERCENTAGE
> >
> >  static phys_addr_t __init __maybe_unused
> cma_early_percent_memory(void)
> > @@ -96,6 +109,34 @@ static inline __maybe_unused phys_addr_t
> cma_early_percent_memory(void)
> >
> >  #endif
> >
> > +#ifdef CONFIG_DMA_PERNUMA_CMA
> > +void __init dma_pernuma_cma_reserve(void)
> > +{
> > +   int nid;
> > +
> > +   if (!pernuma_size_bytes)
> > +   return;
> 
> If this is useful (I assume it is), then I think we should have a non-zero
> default value, a bit like normal CMA does via CMA_SIZE_MBYTES.

The patchet used to have a CONFIG_PERNUMA_CMA_SIZE in kernel/dma/Kconfig, but 
Christoph was not comfortable
with it:
https://lore.kernel.org/linux-iommu/20200728115231.ga...@lst.de/

Would you mind to hardcode the value in CONFIG_CMDLINE in arch/arm64/Kconfig as 
Christoph mentioned:
config CMDLINE
default "pernuma_cma=16M"

If you also don't like the change in arch/arm64/Kconfig CMDLINE, I guess I have 
to depend on users' setting in
cmdline just

RE: [PATCH v6 1/2] dma-contiguous: provide the ability to reserve per-numa CMA

2020-08-21 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org
> [mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Randy Dunlap
> Sent: Friday, August 21, 2020 2:50 PM
> To: Song Bao Hua (Barry Song) ; h...@lst.de;
> m.szyprow...@samsung.com; robin.mur...@arm.com; w...@kernel.org;
> ganapatrao.kulka...@cavium.com; catalin.mari...@arm.com
> Cc: iommu@lists.linux-foundation.org; Linuxarm ;
> linux-arm-ker...@lists.infradead.org; linux-ker...@vger.kernel.org;
> huangdaode ; Jonathan Cameron
> ; Nicolas Saenz Julienne
> ; Steve Capper ; Andrew
> Morton ; Mike Rapoport 
> Subject: Re: [PATCH v6 1/2] dma-contiguous: provide the ability to reserve
> per-numa CMA
> 
> On 8/20/20 7:26 PM, Barry Song wrote:
> >
> >
> > Cc: Jonathan Cameron 
> > Cc: Christoph Hellwig 
> > Cc: Marek Szyprowski 
> > Cc: Will Deacon 
> > Cc: Robin Murphy 
> > Cc: Ganapatrao Kulkarni 
> > Cc: Catalin Marinas 
> > Cc: Nicolas Saenz Julienne 
> > Cc: Steve Capper 
> > Cc: Andrew Morton 
> > Cc: Mike Rapoport 
> > Signed-off-by: Barry Song 
> > ---
> >  v6: rebase on top of 5.9-rc1;
> >  doc cleanup
> >
> >  .../admin-guide/kernel-parameters.txt |   9 ++
> >  include/linux/dma-contiguous.h|   6 ++
> >  kernel/dma/Kconfig|  10 ++
> >  kernel/dma/contiguous.c   | 100
> --
> >  4 files changed, 115 insertions(+), 10 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt
> b/Documentation/admin-guide/kernel-parameters.txt
> > index bdc1f33fd3d1..3f33b89aeab5 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -599,6 +599,15 @@
> > altogether. For more information, see
> > include/linux/dma-contiguous.h
> >
> > +   pernuma_cma=nn[MG]
> 
> memparse() allows any one of these suffixes: K, M, G, T, P, E
> and nothing in the option parsing function cares what suffix is used...

Hello Randy,
Thanks for your comments.

Actually I am following the suffix of default cma:
cma=nn[MG]@[start[MG][-end[MG]]]
[ARM,X86,KNL]
Sets the size of kernel global memory area for
contiguous memory allocations and optionally the
placement constraint by the physical address range of
memory allocations. A value of 0 disables CMA
altogether. For more information, see
include/linux/dma-contiguous.h

I suggest users should set the size in either MB or GB as they set cma. 

> 
> > +   [ARM64,KNL]
> > +   Sets the size of kernel per-numa memory area for
> > +   contiguous memory allocations. A value of 0 disables
> > +   per-numa CMA altogether. DMA users on node nid will
> > +   first try to allocate buffer from the pernuma area
> > +   which is located in node nid, if the allocation fails,
> > +   they will fallback to the global default memory area.
> > +
> > cmo_free_hint=  [PPC] Format: { yes | no }
> > Specify whether pages are marked as being inactive
> > when they are freed.  This is used in CMO environments
> 
> > diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c
> > index cff7e60968b9..89b95f10e56d 100644
> > --- a/kernel/dma/contiguous.c
> > +++ b/kernel/dma/contiguous.c
> > @@ -69,6 +69,19 @@ static int __init early_cma(char *p)
> >  }
> >  early_param("cma", early_cma);
> >
> > +#ifdef CONFIG_DMA_PERNUMA_CMA
> > +
> > +static struct cma *dma_contiguous_pernuma_area[MAX_NUMNODES];
> > +static phys_addr_t pernuma_size_bytes __initdata;
> 
> why phys_addr_t? couldn't it just be unsigned long long?
> 

Mainly because of following the programming habit in kernel/dma/contiguous.c:
I think the original code probably meant the size should not be larger than the 
MAXIMUM
value of phys_addr_t:

/*
 * Default global CMA area size can be defined in kernel's .config.
 * This is useful mainly for distro maintainers to create a kernel
 * that works correctly for most supported systems.
 * The size can be set in bytes or as a percentage of the total memory
 * in the system.
 *
 * Users, who want to set the size of global CMA area for their system
 * should use cma= kernel parameter.
 */
static const phys_addr_t size_bytes __initconst =
(phys_addr_t)CMA_SIZE_MBYT

RE: [PATCH v4] iommu/arm-smmu-v3: permit users to disable msi polling

2020-08-18 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Robin Murphy [mailto:robin.mur...@arm.com]
> Sent: Wednesday, August 19, 2020 2:31 AM
> To: Song Bao Hua (Barry Song) ; w...@kernel.org;
> j...@8bytes.org
> Cc: Zengtao (B) ;
> iommu@lists.linux-foundation.org; linux-arm-ker...@lists.infradead.org;
> Linuxarm 
> Subject: Re: [PATCH v4] iommu/arm-smmu-v3: permit users to disable msi
> polling
> 
> On 2020-08-18 12:17, Barry Song wrote:
> > Polling by MSI isn't necessarily faster than polling by SEV. Tests on
> > hi1620 show hns3 100G NIC network throughput can improve from 25G to
> > 27G if we disable MSI polling while running 16 netperf threads sending
> > UDP packets in size 32KB. TX throughput can improve from 7G to 7.7G for
> > single thread.
> > The reason for the throughput improvement is that the latency to poll
> > the completion of CMD_SYNC becomes smaller. After sending a CMD_SYNC
> > in an empty cmd queue, typically we need to wait for 280ns using MSI
> > polling. But we only need around 190ns after disabling MSI polling.
> > This patch provides a command line option so that users can decide to
> > use MSI polling or not based on their tests.
> >
> > Signed-off-by: Barry Song 
> > ---
> >   -v4: rebase on top of 5.9-rc1
> >   refine changelog
> >
> >   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 18
> ++
> >   1 file changed, 14 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > index 7196207be7ea..89d3cb391fef 100644
> > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > @@ -418,6 +418,11 @@ module_param_named(disable_bypass,
> disable_bypass, bool, S_IRUGO);
> >   MODULE_PARM_DESC(disable_bypass,
> > "Disable bypass streams such that incoming transactions from devices
> that are not attached to an iommu domain will report an abort back to the
> device and will not be allowed to pass through the SMMU.");
> >
> > +static bool disable_msipolling;
> > +module_param_named(disable_msipolling, disable_msipolling, bool,
> S_IRUGO);
> 
> Just use module_param() - going out of the way to specify a "different"
> name that's identical to the variable name is silly.

Thanks for pointing out, also fixed the same issue in the existing parameter
disable_bypass in the new patchset.

But I am sorry I made a typo, the new patchset should be v5. But I wrote v4.

> 
> Also I think the preference these days is to specify permissions as
> plain octal constants rather than those rather inscrutable macros. I
> certainly find that more readable myself.
> 
> (Yes, the existing parameter commits the same offences, but I'd rather
> clean that up separately than perpetuate it)

Thanks for pointing out. Got fixed in the new patchset.

> 
> > +MODULE_PARM_DESC(disable_msipolling,
> > +   "Disable MSI-based polling for CMD_SYNC completion.");
> > +
> >   enum pri_resp {
> > PRI_RESP_DENY = 0,
> > PRI_RESP_FAIL = 1,
> > @@ -980,6 +985,13 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd,
> struct arm_smmu_cmdq_ent *ent)
> > return 0;
> >   }
> >
> > +static bool arm_smmu_use_msipolling(struct arm_smmu_device *smmu)
> > +{
> > +   return !disable_msipolling &&
> > +  smmu->features & ARM_SMMU_FEAT_COHERENCY &&
> > +  smmu->features & ARM_SMMU_FEAT_MSI;
> > +}
> 
> I'd wrap this up into a new ARM_SMMU_OPT_MSIPOLL flag set at probe time,
> rather than constantly reevaluating this whole expression (now that it's
> no longer simply testing two adjacent bits of the same word).

Got it done in the new patchset. It turns out we only need to check one bit now 
with the new
patch:

-   if (smmu->features & ARM_SMMU_FEAT_MSI &&
-   smmu->features & ARM_SMMU_FEAT_COHERENCY)
+   if (smmu->options & ARM_SMMU_OPT_MSIPOLL)
return __arm_smmu_cmdq_poll_until_msi(smmu, llq);


> 
> Robin.
> 
> > +
> >   static void arm_smmu_cmdq_build_sync_cmd(u64 *cmd, struct
> arm_smmu_device *smmu,
> >  u32 prod)
> >   {
> > @@ -992,8 +1004,7 @@ static void arm_smmu_cmdq_build_sync_cmd(u64
> *cmd, struct arm_smmu_device *smmu,
> >  * Beware that Hi16xx adds an extra 32 bits of goodness to its MSI
> >  * payload, so the write will zero the entire command on that platform.
> >  */
> > -   if (smmu->features & ARM_SMMU_FEAT_MSI &&
> > -   smmu-

RE: [PATCH v3] iommu/arm-smmu-v3: permit users to disable MSI polling

2020-08-03 Thread Song Bao Hua (Barry Song)
> -Original Message-
> From: John Garry
> Sent: Tuesday, August 4, 2020 3:34 AM
> To: Song Bao Hua (Barry Song) ; w...@kernel.org;
> robin.mur...@arm.com; j...@8bytes.org; iommu@lists.linux-foundation.org
> Cc: Zengtao (B) ;
> linux-arm-ker...@lists.infradead.org
> Subject: Re: [PATCH v3] iommu/arm-smmu-v3: permit users to disable MSI
> polling
> 
> On 01/08/2020 08:47, Barry Song wrote:
> > Polling by MSI isn't necessarily faster than polling by SEV. Tests on
> > hi1620 show hns3 100G NIC network throughput can improve from 25G to
> > 27G if we disable MSI polling while running 16 netperf threads sending
> > UDP packets in size 32KB.
> 
> BTW, Do we have any more results than this? This is just one scenario.
> 

John, it is more than a scenario. Micro-benchmark shows polling by SEV has less 
latency
than MSI. This motivated me to use a real scenario to verify. For this network 
case, if we set
thread to 1 rather than 16, network TX through can improve from 7Gbps to 7.7Gbps

> How about your micro-benchmark, which allows you to set the number of
> CPUs?

The micro-benchmark is working like this:
Sending A CMD_SYNC in an empty command queue
Polling the completion of this CMD_SYNC by MSI or SEV.

I have seen the polling latency can decrease by about 80ns. Without this patch,
the latency was about ~270ns, after this patch, it would be about
~190ns.

> 
> Thanks,
> John
> 
> > This patch provides a command line option so that users can decide to
> > use MSI polling or not based on their tests.
> >
> > Signed-off-by: Barry Song 
> > ---
> >   -v3:
> >* rebase on top of linux-next as arm-smmu-v3.c has moved;
> >* provide a command line option
> >
> >   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 18
> ++
> >   1 file changed, 14 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > index 7196207be7ea..89d3cb391fef 100644
> > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > @@ -418,6 +418,11 @@ module_param_named(disable_bypass,
> disable_bypass, bool, S_IRUGO);
> >   MODULE_PARM_DESC(disable_bypass,
> > "Disable bypass streams such that incoming transactions from devices
> that are not attached to an iommu domain will report an abort back to the
> device and will not be allowed to pass through the SMMU.");
> >
> > +static bool disable_msipolling;
> > +module_param_named(disable_msipolling, disable_msipolling, bool,
> S_IRUGO);
> > +MODULE_PARM_DESC(disable_msipolling,
> > +   "Disable MSI-based polling for CMD_SYNC completion.");
> > +
> >   enum pri_resp {
> > PRI_RESP_DENY = 0,
> > PRI_RESP_FAIL = 1,
> > @@ -980,6 +985,13 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd,
> struct arm_smmu_cmdq_ent *ent)
> > return 0;
> >   }
> >
> > +static bool arm_smmu_use_msipolling(struct arm_smmu_device *smmu)
> > +{
> > +   return !disable_msipolling &&
> > +  smmu->features & ARM_SMMU_FEAT_COHERENCY &&
> > +  smmu->features & ARM_SMMU_FEAT_MSI;
> > +}
> > +
> >   static void arm_smmu_cmdq_build_sync_cmd(u64 *cmd, struct
> arm_smmu_device *smmu,
> >  u32 prod)
> >   {
> > @@ -992,8 +1004,7 @@ static void arm_smmu_cmdq_build_sync_cmd(u64
> *cmd, struct arm_smmu_device *smmu,
> >  * Beware that Hi16xx adds an extra 32 bits of goodness to its MSI
> >  * payload, so the write will zero the entire command on that platform.
> >  */
> > -   if (smmu->features & ARM_SMMU_FEAT_MSI &&
> > -   smmu->features & ARM_SMMU_FEAT_COHERENCY) {
> > +   if (arm_smmu_use_msipolling(smmu)) {
> > ent.sync.msiaddr = q->base_dma + Q_IDX(>llq, prod) *
> >q->ent_dwords * 8;
> > }
> > @@ -1332,8 +1343,7 @@ static int
> __arm_smmu_cmdq_poll_until_consumed(struct arm_smmu_device *smmu,
> >   static int arm_smmu_cmdq_poll_until_sync(struct arm_smmu_device
> *smmu,
> >  struct arm_smmu_ll_queue *llq)
> >   {
> > -   if (smmu->features & ARM_SMMU_FEAT_MSI &&
> > -   smmu->features & ARM_SMMU_FEAT_COHERENCY)
> > +   if (arm_smmu_use_msipolling(smmu))
> > return __arm_smmu_cmdq_poll_until_msi(smmu, llq);
> >
> > return __arm_smmu_cmdq_poll_until_consumed(smmu, llq);
> >

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v2] iommu/arm-smmu-v3: disable MSI polling if SEV polling is faster

2020-07-31 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Will Deacon [mailto:w...@kernel.org]
> Sent: Saturday, August 1, 2020 12:22 AM
> To: Song Bao Hua (Barry Song) 
> Cc: John Garry ; robin.mur...@arm.com;
> j...@8bytes.org; iommu@lists.linux-foundation.org; Zengtao (B)
> ; Linuxarm ;
> linux-arm-ker...@lists.infradead.org
> Subject: Re: [PATCH v2] iommu/arm-smmu-v3: disable MSI polling if SEV
> polling is faster
> 
> On Fri, Jul 31, 2020 at 10:48:33AM +, Song Bao Hua (Barry Song) wrote:
> > > -Original Message-
> > > From: John Garry
> > > Sent: Friday, July 31, 2020 10:21 PM
> > > To: Song Bao Hua (Barry Song) ;
> w...@kernel.org;
> > > robin.mur...@arm.com; j...@8bytes.org;
> iommu@lists.linux-foundation.org
> > > Cc: Zengtao (B) ; Linuxarm
> > > ; linux-arm-ker...@lists.infradead.org
> > > Subject: Re: [PATCH v2] iommu/arm-smmu-v3: disable MSI polling if SEV
> > > polling is faster
> > >
> > > On 31/07/2020 09:33, Barry Song wrote:
> > > > Different implementations may show different performance by using SEV
> > > > polling or MSI polling.
> > > > On the implementation of hi1620, tests show disabling MSI polling can
> > > > bring performance improvement.
> > > > Using 16 threads to run netperf on hns3 100G NIC with UDP packet size
> > > > in 32768bytes and set iommu to strict, TX throughput can improve from
> > > > 25Gbps to 27Gbps by this patch.
> > > > This patch adds a generic function to support implementation options
> > > > based on IIDR and disables MSI polling if IIDR matches the specific
> > > > implementation tested.
> > > Not sure if we should do checks like this on an implementation basis.
> > > I'm sure maintainers will decide.
> >
> > Yes, maintainers will decide. I guess Will won't object to IIDR-based 
> > solution
> according to
> > previous discussion threads:
> > https://lore.kernel.org/patchwork/patch/783718/
> >
> > Am I right, Will?
> 
> Honestly, I object to the whole idea that we should turn off optional
> hardware features just because they're slow. Did nobody take time to look at
> the design and check that it offered some benefit, or where they in too much
> of a hurry to tick the checkbox to say they had the new feature? I really
> dislike the pick and mix nature that some of this IP is heading in, where
> the marketing folks want a slice of everything for the branding, instead of
> doing a few useful things well. Anyway, that's not your fault, so I'll stop
> moaning. *sigh*
> 
> Given that you've baked this thing now, then if we have to support it I
> would prefer the command-line option. At least that means that people can
> compare the performance with it on and off (and hopefully make sure the
> hardware doesn't suck). It also means it's not specific to ACPI.

Hi Will,
Thanks for your comment. I had a patch with command line option as below.
If it is what you prefer, I'd refine this one and send.

[PATCH] iommu/arm-smmu-v3: permit users to disable msi polling
---
 drivers/iommu/arm-smmu-v3.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index f578677a5c41..4fb1681308e4 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -418,6 +418,11 @@ module_param_named(disable_bypass, disable_bypass, bool, 
S_IRUGO);
 MODULE_PARM_DESC(disable_bypass,
"Disable bypass streams such that incoming transactions from devices 
that are not attached to an iommu domain will report an abort back to the 
device and will not be allowed to pass through the SMMU.");
 
+static bool disable_msipolling = 1;
+module_param_named(disable_msipolling, disable_msipolling, bool, S_IRUGO);
+MODULE_PARM_DESC(disable_msipolling,
+   "Don't use MSI to poll the completion of CMD_SYNC if it is slower than 
SEV");
+
 enum pri_resp {
PRI_RESP_DENY = 0,
PRI_RESP_FAIL = 1,
@@ -992,7 +997,7 @@ static void arm_smmu_cmdq_build_sync_cmd(u64 *cmd, struct 
arm_smmu_device *smmu,
 * Beware that Hi16xx adds an extra 32 bits of goodness to its MSI
 * payload, so the write will zero the entire command on that platform.
 */
-   if (smmu->features & ARM_SMMU_FEAT_MSI &&
+   if (!disable_msipolling && smmu->features & ARM_SMMU_FEAT_MSI &&
smmu->features & ARM_SMMU_FEAT_COHERENCY) {
ent.sync.msiaddr = q->base_dma + Q_IDX(>llq, prod) *
   q->ent_dwords * 8;
@@ -1332,7 +1337,7 @@ static int __arm_smmu_cmdq_poll_until_consumed(struct 

RE: [PATCH v2] iommu/arm-smmu-v3: disable MSI polling if SEV polling is faster

2020-07-31 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: John Garry
> Sent: Friday, July 31, 2020 10:21 PM
> To: Song Bao Hua (Barry Song) ; w...@kernel.org;
> robin.mur...@arm.com; j...@8bytes.org; iommu@lists.linux-foundation.org
> Cc: Zengtao (B) ; Linuxarm
> ; linux-arm-ker...@lists.infradead.org
> Subject: Re: [PATCH v2] iommu/arm-smmu-v3: disable MSI polling if SEV
> polling is faster
> 
> On 31/07/2020 09:33, Barry Song wrote:
> > Different implementations may show different performance by using SEV
> > polling or MSI polling.
> > On the implementation of hi1620, tests show disabling MSI polling can
> > bring performance improvement.
> > Using 16 threads to run netperf on hns3 100G NIC with UDP packet size
> > in 32768bytes and set iommu to strict, TX throughput can improve from
> > 25Gbps to 27Gbps by this patch.
> > This patch adds a generic function to support implementation options
> > based on IIDR and disables MSI polling if IIDR matches the specific
> > implementation tested.
> Not sure if we should do checks like this on an implementation basis.
> I'm sure maintainers will decide.

Yes, maintainers will decide. I guess Will won't object to IIDR-based solution 
according to
previous discussion threads:
https://lore.kernel.org/patchwork/patch/783718/

Am I right, Will?

> 
> >
> > Cc: Prime Zeng 
> > Signed-off-by: Barry Song 
> > ---
> >   -v2: rather than disabling msipolling globally, only disable it for
> >   specific implementation based on IIDR
> >
> >   drivers/iommu/arm-smmu-v3.c | 31 +--
> 
> this file has moved, check linux-next

Thanks for reminding. Hopefully Will or Robin can give some feedback so that 
the v3 can come to fix this
as well as other issues they might point out.

> 
> >   1 file changed, 29 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> > index f578677a5c41..ed5a6774eb45 100644
> > --- a/drivers/iommu/arm-smmu-v3.c
> > +++ b/drivers/iommu/arm-smmu-v3.c
> > @@ -88,6 +88,12 @@
> >   #define IDR5_VAX  GENMASK(11, 10)
> >   #define IDR5_VAX_52_BIT   1
> >
> > +#define ARM_SMMU_IIDR  0x18
> > +#define IIDR_VARIANT   GENMASK(19, 16)
> > +#define IIDR_REVISION  GENMASK(15, 12)
> > +#define IIDR_IMPLEMENTER   GENMASK(11, 0)
> > +#define IMPLEMENTER_HISILICON  0x736
> > +
> >   #define ARM_SMMU_CR0  0x20
> >   #define CR0_ATSCHK(1 << 4)
> >   #define CR0_CMDQEN(1 << 3)
> > @@ -652,6 +658,7 @@ struct arm_smmu_device {
> >
> >   #define ARM_SMMU_OPT_SKIP_PREFETCH(1 << 0)
> >   #define ARM_SMMU_OPT_PAGE0_REGS_ONLY  (1 << 1)
> > +#define ARM_SMMU_OPT_DISABLE_MSIPOLL(1 << 2)
> > u32 options;
> >
> > struct arm_smmu_cmdqcmdq;
> > @@ -992,7 +999,8 @@ static void arm_smmu_cmdq_build_sync_cmd(u64
> *cmd, struct arm_smmu_device *smmu,
> >  * Beware that Hi16xx adds an extra 32 bits of goodness to its MSI
> >  * payload, so the write will zero the entire command on that platform.
> >  */
> > -   if (smmu->features & ARM_SMMU_FEAT_MSI &&
> > +   if (!(smmu->options & ARM_SMMU_OPT_DISABLE_MSIPOLL) &&
> > +   smmu->features & ARM_SMMU_FEAT_MSI &&
> 
> I don't know why you check MSIPOLL disabled and then MSI poll supported.
> Surely for native non-MSI poll (like hi1616), the ARM_SMMU_FEAT_MSI
> check first makes sense. This is fastpath, albeit fast to maybe wait..

I was thinking !(smmu->options & ARM_SMMU_OPT_DISABLE_MSIPOLL) is the fast path
for 1620 as it can jump out quickly.

But yes, you are right since there are many other SMMU not only 1620.


> 
> > smmu->features & ARM_SMMU_FEAT_COHERENCY) {
> > ent.sync.msiaddr = q->base_dma + Q_IDX(>llq, prod) *
> >q->ent_dwords * 8;
> > @@ -1332,7 +1340,8 @@ static int
> __arm_smmu_cmdq_poll_until_consumed(struct arm_smmu_device *smmu,
> >   static int arm_smmu_cmdq_poll_until_sync(struct arm_smmu_device
> *smmu,
> >  struct arm_smmu_ll_queue *llq)
> >   {
> > -   if (smmu->features & ARM_SMMU_FEAT_MSI &&
> > +   if (!(smmu->options & ARM_SMMU_OPT_DISABLE_MSIPOLL) &&
> > +   smmu->features & ARM_SMMU_FEAT_MSI &&
> > smmu->fea

RE: [PATCH v4 1/2] dma-direct: provide the ability to reserve per-numa CMA

2020-07-29 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Christoph Hellwig [mailto:h...@lst.de]
> Sent: Wednesday, July 29, 2020 12:23 AM
> To: Song Bao Hua (Barry Song) 
> Cc: Christoph Hellwig ; m.szyprow...@samsung.com;
> robin.mur...@arm.com; w...@kernel.org; ganapatrao.kulka...@cavium.com;
> catalin.mari...@arm.com; iommu@lists.linux-foundation.org; Linuxarm
> ; linux-arm-ker...@lists.infradead.org;
> linux-ker...@vger.kernel.org; Zengtao (B) ;
> huangdaode ; Jonathan Cameron
> ; Nicolas Saenz Julienne
> ; Steve Capper ; Andrew
> Morton ; Mike Rapoport 
> Subject: Re: [PATCH v4 1/2] dma-direct: provide the ability to reserve
> per-numa CMA
> 
> On Tue, Jul 28, 2020 at 12:19:03PM +, Song Bao Hua (Barry Song) wrote:
> > I am sorry I haven't got your point yet. Do you mean something like the
> below?
> >
> > arch/arm64/Kconfig:
> > config CMDLINE
> > string "Default kernel command string"
> > -   default ""
> > +   default "pernuma_cma=16M"
> > help
> >   Provide a set of default command-line options at build time by
> >   entering them here. As a minimum, you should specify the the
> >   root device (e.g. root=/dev/nfs).
> 
> Yes.
> 
> > A background of the current code is that Linux distributions can usually use
> arch/arm64/configs/defconfig
> > directly to build kernel. cmdline can be easily ignored during the 
> > generation
> of Linux distributions.
> 
> I've not actually heard of a distro shipping defconfig yet..
> 
> >
> > > if a way to expose this in the device tree might be useful, but people
> > > more familiar with the device tree and the arm code will have to chime
> > > in on that.
> >
> > Not sure if it is an useful user case as we are using ACPI but not device 
> > tree
> since it is an ARM64
> > server with NUMA.
> 
> Well, than maybe ACPI experts need to chime in on this.
> 
> > > This seems to have lost the dma_contiguous_default_area NULL check.
> >
> > cma_alloc() is doing the check by returning NULL if cma is NULL.
> >
> > struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
> >bool no_warn)
> > {
> > ...
> > if (!cma || !cma->count)
> > return NULL;
> > }
> >
> > But I agree here the code can check before calling cma_alloc_aligned.
> 
> Oh, indeed.  Please split the removal of the NULL check in to a prep
> patch then.

Do you mean removing the NULL check in cma_alloc()? If so, it seems lot of 
places
need to be changed:

struct page *dma_alloc_from_contiguous(struct device *dev, size_t count,
   unsigned int align, bool no_warn)
{
if (align > CONFIG_CMA_ALIGNMENT)
align = CONFIG_CMA_ALIGNMENT;
+ code to check dev_get_cma_area(dev) is not NULL
return cma_alloc(dev_get_cma_area(dev), count, align, no_warn);
}

bool dma_release_from_contiguous(struct device *dev, struct page *pages,
 int count)
{
+ code to check dev_get_cma_area(dev) is not NULL
return cma_release(dev_get_cma_area(dev), pages, count);
}

bool cma_release(struct cma *cma, const struct page *pages, unsigned int count)
{
unsigned long pfn;
+ do we need to remove this !cma too if we remove it in cma_alloc()?
if (!cma || !pages)
return false;
...
}

And some other places where cma_alloc() and cma_release() are called:

arch/powerpc/kvm/book3s_hv_builtin.c
drivers/dma-buf/heaps/cma_heap.c
drivers/s390/char/vmcp.c
drivers/staging/android/ion/ion_cma_heap.c
mm/hugetlb.c

it seems many code were written with the assumption that cma_alloc/release will
check if cma is null so they don't check it before calling cma_alloc().

And I am not sure if kernel robot will report error like pointer reference 
before checking
it if !cma is removed in cma_alloc().

Thanks
Barry
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v4 1/2] dma-direct: provide the ability to reserve per-numa CMA

2020-07-28 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Christoph Hellwig [mailto:h...@lst.de]
> Sent: Tuesday, July 28, 2020 11:53 PM
> To: Song Bao Hua (Barry Song) 
> Cc: h...@lst.de; m.szyprow...@samsung.com; robin.mur...@arm.com;
> w...@kernel.org; ganapatrao.kulka...@cavium.com;
> catalin.mari...@arm.com; iommu@lists.linux-foundation.org; Linuxarm
> ; linux-arm-ker...@lists.infradead.org;
> linux-ker...@vger.kernel.org; Zengtao (B) ;
> huangdaode ; Jonathan Cameron
> ; Nicolas Saenz Julienne
> ; Steve Capper ; Andrew
> Morton ; Mike Rapoport 
> Subject: Re: [PATCH v4 1/2] dma-direct: provide the ability to reserve
> per-numa CMA
> 
> On Fri, Jul 24, 2020 at 01:13:43AM +1200, Barry Song wrote:
> > +config CMA_PERNUMA_SIZE_MBYTES
> > +   int "Size in Mega Bytes for per-numa CMA areas"
> > +   depends on NUMA
> > +   default 16 if ARM64
> > +   default 0
> > +   help
> > + Defines the size (in MiB) of the per-numa memory area for Contiguous
> > + Memory Allocator. Every numa node will get a separate CMA with this
> > + size. If the size of 0 is selected, per-numa CMA is disabled.
> 
> I'm still not a fan of the config option.  You can just hardcode the
> value in CONFIG_CMDLINE based on the kernel parameter.  Also I wonder

I am sorry I haven't got your point yet. Do you mean something like the below?

arch/arm64/Kconfig:
config CMDLINE
string "Default kernel command string"
-   default ""
+   default "pernuma_cma=16M"
help
  Provide a set of default command-line options at build time by
  entering them here. As a minimum, you should specify the the
  root device (e.g. root=/dev/nfs).

A background of the current code is that Linux distributions can usually use 
arch/arm64/configs/defconfig
directly to build kernel. cmdline can be easily ignored during the generation 
of Linux distributions.

> if a way to expose this in the device tree might be useful, but people
> more familiar with the device tree and the arm code will have to chime
> in on that.

Not sure if it is an useful user case as we are using ACPI but not device tree 
since it is an ARM64
server with NUMA.

> 
> >  struct cma *dma_contiguous_default_area;
> > +static struct cma *dma_contiguous_pernuma_area[MAX_NUMNODES];
> >
> >  /*
> >   * Default global CMA area size can be defined in kernel's .config.
> > @@ -44,6 +51,8 @@ struct cma *dma_contiguous_default_area;
> >   */
> >  static const phys_addr_t size_bytes __initconst =
> > (phys_addr_t)CMA_SIZE_MBYTES * SZ_1M;
> > +static phys_addr_t pernuma_size_bytes __initdata =
> > +   (phys_addr_t)CMA_SIZE_PERNUMA_MBYTES * SZ_1M;
> >  static phys_addr_t  size_cmdline __initdata = -1;
> >  static phys_addr_t base_cmdline __initdata;
> >  static phys_addr_t limit_cmdline __initdata;
> > @@ -69,6 +78,13 @@ static int __init early_cma(char *p)
> >  }
> >  early_param("cma", early_cma);
> >
> > +static int __init early_pernuma_cma(char *p)
> > +{
> > +   pernuma_size_bytes = memparse(p, );
> > +   return 0;
> > +}
> > +early_param("pernuma_cma", early_pernuma_cma);
> > +
> >  #ifdef CONFIG_CMA_SIZE_PERCENTAGE
> >
> >  static phys_addr_t __init __maybe_unused
> cma_early_percent_memory(void)
> > @@ -96,6 +112,33 @@ static inline __maybe_unused phys_addr_t
> cma_early_percent_memory(void)
> >
> >  #endif
> >
> > +void __init dma_pernuma_cma_reserve(void)
> > +{
> > +   int nid;
> > +
> > +   if (!pernuma_size_bytes)
> > +   return;
> > +
> > +   for_each_node_state(nid, N_MEMORY) {
> > +   int ret;
> > +   char name[20];
> > +
> > +   snprintf(name, sizeof(name), "pernuma%d", nid);
> > +   ret = cma_declare_contiguous_nid(0, pernuma_size_bytes, 0, 0,
> > +0, false, name,
> > +
> > _contiguous_pernuma_area[nid],
> > +nid);
> 
> This adds a > 80 char line.

Will refine.

> 
> >  struct page *dma_alloc_contiguous(struct device *dev, size_t size, gfp_t
> gfp)
> >  {
> > +   int nid = dev_to_node(dev);
> > +
> > /* CMA can be used only in the context which permits sleeping */
> > if (!gfpflags_allow_blocking(gfp))
> > return NULL;
> > if (dev->cma_area)
> > return cma_alloc_aligned(dev->cma_area, size, gfp);
> > -   if (size <= PAGE_SIZE || !dma_contiguous_default_area)
> > +

RE: [PATCH v2] dma-contiguous: cleanup dma_alloc_contiguous

2020-07-27 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: iommu [mailto:iommu-boun...@lists.linux-foundation.org] On Behalf
> Of Christoph Hellwig
> Sent: Friday, July 24, 2020 12:02 AM
> To: iommu@lists.linux-foundation.org
> Cc: robin.mur...@arm.com
> Subject: [PATCH v2] dma-contiguous: cleanup dma_alloc_contiguous
> 
> Split out a cma_alloc_aligned helper to deal with the "interesting"
> calling conventions for cma_alloc, which then allows to the main function to
> be written straight forward.  This also takes advantage of the fact that NULL
> dev arguments have been gone from the DMA API for a while.
> 
> Signed-off-by: Christoph Hellwig 

Reviewed-by: Barry Song 

And I have rebased per-numa CMA patchset on top of this one.
https://lore.kernel.org/linux-arm-kernel/20200723131344.41472-1-song.bao@hisilicon.com/

> ---
> 
> Changes since v1:
>  - actually pass on the select struct cma
>  - clean up cma_alloc_aligned a bit
> 
>  kernel/dma/contiguous.c | 31 ++-
>  1 file changed, 14 insertions(+), 17 deletions(-)
> 
> diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c index
> 15bc5026c485f2..cff7e60968b9e1 100644
> --- a/kernel/dma/contiguous.c
> +++ b/kernel/dma/contiguous.c
> @@ -215,6 +215,13 @@ bool dma_release_from_contiguous(struct device
> *dev, struct page *pages,
>   return cma_release(dev_get_cma_area(dev), pages, count);  }
> 
> +static struct page *cma_alloc_aligned(struct cma *cma, size_t size,
> +gfp_t gfp) {
> + unsigned int align = min(get_order(size), CONFIG_CMA_ALIGNMENT);
> +
> + return cma_alloc(cma, size >> PAGE_SHIFT, align, gfp & __GFP_NOWARN);
> +}
> +
>  /**
>   * dma_alloc_contiguous() - allocate contiguous pages
>   * @dev:   Pointer to device for which the allocation is performed.
> @@ -231,24 +238,14 @@ bool dma_release_from_contiguous(struct device
> *dev, struct page *pages,
>   */
>  struct page *dma_alloc_contiguous(struct device *dev, size_t size, gfp_t gfp)
> {
> - size_t count = size >> PAGE_SHIFT;
> - struct page *page = NULL;
> - struct cma *cma = NULL;
> -
> - if (dev && dev->cma_area)
> - cma = dev->cma_area;
> - else if (count > 1)
> - cma = dma_contiguous_default_area;
> -
>   /* CMA can be used only in the context which permits sleeping */
> - if (cma && gfpflags_allow_blocking(gfp)) {
> - size_t align = get_order(size);
> - size_t cma_align = min_t(size_t, align, CONFIG_CMA_ALIGNMENT);
> -
> - page = cma_alloc(cma, count, cma_align, gfp & __GFP_NOWARN);
> - }
> -
> - return page;
> + if (!gfpflags_allow_blocking(gfp))
> + return NULL;
> + if (dev->cma_area)
> + return cma_alloc_aligned(dev->cma_area, size, gfp);
> + if (size <= PAGE_SIZE || !dma_contiguous_default_area)
> + return NULL;
> + return cma_alloc_aligned(dma_contiguous_default_area, size, gfp);
>  }
> 
>  /**
> --
> 2.27.0
Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v3 1/2] dma-direct: provide the ability to reserve per-numa CMA

2020-07-23 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Christoph Hellwig [mailto:h...@lst.de]
> Sent: Friday, July 24, 2020 12:01 AM
> To: Song Bao Hua (Barry Song) 
> Cc: Christoph Hellwig ; m.szyprow...@samsung.com;
> robin.mur...@arm.com; w...@kernel.org; ganapatrao.kulka...@cavium.com;
> catalin.mari...@arm.com; iommu@lists.linux-foundation.org; Linuxarm
> ; linux-arm-ker...@lists.infradead.org;
> linux-ker...@vger.kernel.org; Jonathan Cameron
> ; Nicolas Saenz Julienne
> ; Steve Capper ; Andrew
> Morton ; Mike Rapoport ;
> Zengtao (B) ; huangdaode
> 
> Subject: Re: [PATCH v3 1/2] dma-direct: provide the ability to reserve
> per-numa CMA
> 
> On Wed, Jul 22, 2020 at 09:41:50PM +, Song Bao Hua (Barry Song)
> wrote:
> > I got a kernel robot warning which said dev should be checked before
> > being accessed when I did a similar change in v1. Probably it was an
> > invalid warning if dev should never be null.
> 
> That usually shows up if a function is inconsistent about sometimes checking 
> it
> and sometimes now.
> 
> > Yes, it looks much better.
> 
> Below is a prep patch to rebase on top of:

Thanks for letting me know.

Will rebase on top of your patch.

> 
> ---
> From b81a5e1da65fce9750f0a8b66dbb6f842cbfdd4d Mon Sep 17 00:00:00
> 2001
> From: Christoph Hellwig 
> Date: Wed, 22 Jul 2020 16:33:43 +0200
> Subject: dma-contiguous: cleanup dma_alloc_contiguous
> 
> Split out a cma_alloc_aligned helper to deal with the "interesting"
> calling conventions for cma_alloc, which then allows to the main function to
> be written straight forward.  This also takes advantage of the fact that NULL
> dev arguments have been gone from the DMA API for a while.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  kernel/dma/contiguous.c | 31 ++-
>  1 file changed, 14 insertions(+), 17 deletions(-)
> 
> diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c index
> 15bc5026c485f2..cff7e60968b9e1 100644
> --- a/kernel/dma/contiguous.c
> +++ b/kernel/dma/contiguous.c
> @@ -215,6 +215,13 @@ bool dma_release_from_contiguous(struct device
> *dev, struct page *pages,
>   return cma_release(dev_get_cma_area(dev), pages, count);  }
> 
> +static struct page *cma_alloc_aligned(struct cma *cma, size_t size,
> +gfp_t gfp) {
> + unsigned int align = min(get_order(size), CONFIG_CMA_ALIGNMENT);
> +
> + return cma_alloc(cma, size >> PAGE_SHIFT, align, gfp & __GFP_NOWARN);
> +}
> +
>  /**
>   * dma_alloc_contiguous() - allocate contiguous pages
>   * @dev:   Pointer to device for which the allocation is performed.
> @@ -231,24 +238,14 @@ bool dma_release_from_contiguous(struct device
> *dev, struct page *pages,
>   */
>  struct page *dma_alloc_contiguous(struct device *dev, size_t size, gfp_t gfp)
> {
> - size_t count = size >> PAGE_SHIFT;
> - struct page *page = NULL;
> - struct cma *cma = NULL;
> -
> - if (dev && dev->cma_area)
> - cma = dev->cma_area;
> - else if (count > 1)
> - cma = dma_contiguous_default_area;
> -
>   /* CMA can be used only in the context which permits sleeping */
> - if (cma && gfpflags_allow_blocking(gfp)) {
> - size_t align = get_order(size);
> - size_t cma_align = min_t(size_t, align, CONFIG_CMA_ALIGNMENT);
> -
> - page = cma_alloc(cma, count, cma_align, gfp & __GFP_NOWARN);
> - }
> -
> - return page;
> + if (!gfpflags_allow_blocking(gfp))
> + return NULL;
> + if (dev->cma_area)
> + return cma_alloc_aligned(dev->cma_area, size, gfp);
> + if (size <= PAGE_SIZE || !dma_contiguous_default_area)
> + return NULL;
> + return cma_alloc_aligned(dma_contiguous_default_area, size, gfp);
>  }
> 
>  /**
> --
> 2.27.0

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v3 1/2] dma-direct: provide the ability to reserve per-numa CMA

2020-07-22 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Christoph Hellwig [mailto:h...@lst.de]
> Sent: Thursday, July 23, 2020 2:30 AM
> To: Song Bao Hua (Barry Song) 
> Cc: h...@lst.de; m.szyprow...@samsung.com; robin.mur...@arm.com;
> w...@kernel.org; ganapatrao.kulka...@cavium.com;
> catalin.mari...@arm.com; iommu@lists.linux-foundation.org; Linuxarm
> ; linux-arm-ker...@lists.infradead.org;
> linux-ker...@vger.kernel.org; Jonathan Cameron
> ; Nicolas Saenz Julienne
> ; Steve Capper ; Andrew
> Morton ; Mike Rapoport 
> Subject: Re: [PATCH v3 1/2] dma-direct: provide the ability to reserve
> per-numa CMA
> 
+cc Prime and Daode who are interested in this patchset.

> On Sun, Jun 28, 2020 at 11:12:50PM +1200, Barry Song wrote:
> >  struct page *dma_alloc_contiguous(struct device *dev, size_t size, gfp_t
> gfp)
> >  {
> > size_t count = size >> PAGE_SHIFT;
> > struct page *page = NULL;
> > struct cma *cma = NULL;
> > +   int nid = dev ? dev_to_node(dev) : NUMA_NO_NODE;
> > +   bool alloc_from_pernuma = false;
> > +
> > +   if ((count <= 1) && !(dev && dev->cma_area))
> > +   return NULL;
> >
> > if (dev && dev->cma_area)
> > cma = dev->cma_area;
> > -   else if (count > 1)
> > +   else if ((nid != NUMA_NO_NODE) &&
> dma_contiguous_pernuma_area[nid]
> > +   && !(gfp & (GFP_DMA | GFP_DMA32))) {
> > +   cma = dma_contiguous_pernuma_area[nid];
> > +   alloc_from_pernuma = true;
> > +   } else {
> > cma = dma_contiguous_default_area;
> > +   }
> 
> I find the function rather confusing now.  What about something
> like (this relies on the fact that dev should never be NULL in the
> DMA API)
> 
> struct page *dma_alloc_contiguous(struct device *dev, size_t size, gfp_t gfp)
> {
>   size_t cma_align = min_t(size_t, get_order(size),
> CONFIG_CMA_ALIGNMENT);
>   size_t count = size >> PAGE_SHIFT;
> 
>   if (gfpflags_allow_blocking(gfp))
>   return NULL;
>   gfp &= __GFP_NOWARN;
> 
>   if (dev->cma_area)

I got a kernel robot warning which said dev should be checked before being 
accessed
when I did a similar change in v1. Probably it was an invalid warning if dev 
should
never be null.

>   return cma_alloc(dev->cma_area, count, cma_align, gfp);
>   if (count <= 1)
>   return NULL;
> 
>   if (IS_ENABLED(CONFIG_PERNODE_CMA) && !(gfp & (GFP_DMA |
> GFP_DMA32)) {
>   int nid = dev_to_node(dev);
>   struct cma *cma = dma_contiguous_pernuma_area[nid];
>   struct page *page;
> 
>   if (cma) {
>   page = cma_alloc(cma, count, cma_align, gfp);
>   if (page)
>   return page;
>   }
>   }
> 
>   return cma_alloc(dma_contiguous_default_area, count, cma_align, gfp);
> }

Yes, it looks much better.

> 
> > +   /*
> > +* otherwise, page is from either per-numa cma or default cma
> > +*/
> > +   int nid = page_to_nid(page);
> > +
> > +   if (nid != NUMA_NO_NODE) {
> > +   if (cma_release(dma_contiguous_pernuma_area[nid], page,
> > +   PAGE_ALIGN(size) >> PAGE_SHIFT))
> > +   return;
> > +   }
> > +
> > +   if (cma_release(dma_contiguous_default_area, page,
> 
> How can page_to_nid ever return NUMA_NO_NODE?

I thought page_to_nid would return NUMA_NO_NODE if CONFIG_NUMA is
not enabled. Probably I was wrong. Will get it fixed in v4.

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v3 1/2] dma-direct: provide the ability to reserve per-numa CMA

2020-07-22 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Christoph Hellwig [mailto:h...@lst.de]
> Sent: Thursday, July 23, 2020 2:17 AM
> To: Song Bao Hua (Barry Song) 
> Cc: h...@lst.de; m.szyprow...@samsung.com; robin.mur...@arm.com;
> w...@kernel.org; ganapatrao.kulka...@cavium.com;
> catalin.mari...@arm.com; iommu@lists.linux-foundation.org; Linuxarm
> ; linux-arm-ker...@lists.infradead.org;
> linux-ker...@vger.kernel.org; Jonathan Cameron
> ; Nicolas Saenz Julienne
> ; Steve Capper ; Andrew
> Morton ; Mike Rapoport 
> Subject: Re: [PATCH v3 1/2] dma-direct: provide the ability to reserve
> per-numa CMA
> 

+cc Prime and Daode who are interested in this patchset.

> On Sun, Jun 28, 2020 at 11:12:50PM +1200, Barry Song wrote:
> > This is useful for at least two scenarios:
> > 1. ARM64 smmu will get memory from local numa node, it can save its
> > command queues and page tables locally. Tests show it can decrease
> > dma_unmap latency at lot. For example, without this patch, smmu on
> > node2 will get memory from node0 by calling dma_alloc_coherent(),
> > typically, it has to wait for more than 560ns for the completion of
> > CMD_SYNC in an empty command queue; with this patch, it needs 240ns
> > only.
> > 2. when we set iommu passthrough, drivers will get memory from CMA,
> > local memory means much less latency.
> 
> I really don't like the config options.  With the boot parameters
> you can always hardcode that in CONFIG_CMDLINE anyway.

I understand your concern. Anyway, The primary purpose of this patchset is 
providing
a general way for users like IOMMU to get local coherent dma buffers to put 
their
command queue and page tables in. The first user case is what really made me
begin to prepare this patchset.

For the second case, it is probably a positive side effect of this patchset for 
those users
who have more concern on performance than dma security, then they maybe skip
IOMMU by
iommu.passthrough=
[ARM64, X86] Configure DMA to bypass the IOMMU by 
default.
Format: { "0" | "1" }
0 - Use IOMMU translation for DMA.
1 - Bypass the IOMMU for DMA.
unset - Use value of CONFIG_IOMMU_DEFAULT_PASSTHROUGH.
In this case, they can get local memory and get better performance.
However, it is not the primary purpose of this patchset.

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH] iommu/arm-smmu-v3: remove the approach of MSI polling for CMD SYNC

2020-07-20 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Song Bao Hua (Barry Song)
> Sent: Friday, July 17, 2020 9:06 PM
> To: 'Robin Murphy' ; w...@kernel.org;
> j...@8bytes.org
> Cc: linux-ker...@vger.kernel.org; Linuxarm ;
> linux-arm-ker...@lists.infradead.org; iommu@lists.linux-foundation.org;
> Zengtao (B) 
> Subject: RE: [PATCH] iommu/arm-smmu-v3: remove the approach of MSI
> polling for CMD SYNC
> 
> 
> 
> > -Original Message-
> > From: Robin Murphy [mailto:robin.mur...@arm.com]
> > Sent: Friday, July 17, 2020 8:55 PM
> > To: Song Bao Hua (Barry Song) ;
> w...@kernel.org;
> > j...@8bytes.org
> > Cc: linux-ker...@vger.kernel.org; Linuxarm ;
> > linux-arm-ker...@lists.infradead.org; iommu@lists.linux-foundation.org;
> > Zengtao (B) 
> > Subject: Re: [PATCH] iommu/arm-smmu-v3: remove the approach of MSI
> > polling for CMD SYNC
> >
> > On 2020-07-17 00:07, Barry Song wrote:
> > > Before commit 587e6c10a7ce ("iommu/arm-smmu-v3: Reduce contention
> > during
> > > command-queue insertion"), msi polling perhaps performed better since
> > > it could run outside the spin_lock_irqsave() while the code polling cons
> > > reg was running in the lock.
> > >
> > > But after the great reorganization of smmu queue, neither of these two
> > > polling methods are running in a spinlock. And real tests show polling
> > > cons reg via sev means smaller latency. It is probably because polling
> > > by msi will ask hardware to write memory but sev polling depends on the
> > > update of register only.
> > >
> > > Using 16 threads to run netperf on hns3 100G NIC with UDP packet size
> > > in 32768bytes and set iommu to strict, TX throughput can improve from
> > > 25227.74Mbps to 27145.59Mbps by this patch. In this case, SMMU is
> super
> > > busy as hns3 sends map/unmap requests extremely frequently.
> >
> > How many different systems and SMMU implementations are those numbers
> > representative of? Given that we may have cases where the SMMU can use
> > MSIs but can't use SEV, so would have to fall back to inefficient
> > busy-polling, I'd be wary of removing this entirely. Allowing particular
> > platforms or SMMU implementations to suppress MSI functionality if they
> > know for sure it makes sense seems like a safer bet.
> >
> Hello Robin,
> 
> Thanks for taking a look. Actually I was really struggling with the good way 
> to
> make every platform happy.
> And I don't have other platforms to test and check if those platforms run
> better by sev polling. Even two
> platforms have completely same SMMU features, it is still possible they
> behave differently.
> So I simply sent this patch to get the discussion started to get opinions.
> 
> At the first beginning, I wanted to have a module parameter for users to 
> decide
> if msi polling should be disabled.
> But the module parameter might be totally ignored by linux distro.
> 
> --- a/drivers/iommu/arm-smmu-v3.c
> +++ b/drivers/iommu/arm-smmu-v3.c
> @@ -418,6 +418,11 @@ module_param_named(disable_bypass,
> disable_bypass, bool, S_IRUGO);  MODULE_PARM_DESC(disable_bypass,
>   "Disable bypass streams such that incoming transactions from devices
> that are not attached to an iommu domain will report an abort back to the
> device and will not be allowed to pass through the SMMU.");
> 
> +static bool disable_msipolling = 1;
> +module_param_named(disable_msipolling, disable_msipolling, bool,
> +S_IRUGO); MODULE_PARM_DESC(disable_msipolling,
> + "Don't use MSI to poll the completion of CMD_SYNC if it is slower than
> +SEV");
> +
>  enum pri_resp {
>   PRI_RESP_DENY = 0,
>   PRI_RESP_FAIL = 1,
> @@ -992,7 +997,7 @@ static void arm_smmu_cmdq_build_sync_cmd(u64
> *cmd, struct arm_smmu_device *smmu,
>* Beware that Hi16xx adds an extra 32 bits of goodness to its MSI
>* payload, so the write will zero the entire command on that platform.
>*/
> - if (smmu->features & ARM_SMMU_FEAT_MSI &&
> + if (!disable_msipolling && smmu->features & ARM_SMMU_FEAT_MSI &&
>   smmu->features & ARM_SMMU_FEAT_COHERENCY) {
>   ent.sync.msiaddr = q->base_dma + Q_IDX(>llq, prod) *
>  q->ent_dwords * 8;
> @@ -1332,7 +1337,7 @@ static int
> __arm_smmu_cmdq_poll_until_consumed(struct arm_smmu_device *smmu,
> static int arm_smmu_cmdq_poll_until_sync(struct arm_smmu_device *smmu,
>struct arm_smmu_ll_queue *llq)
>  {
> - if (smmu->features & ARM_

RE: [PATCH] iommu/arm-smmu-v3: remove the approach of MSI polling for CMD SYNC

2020-07-17 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Robin Murphy [mailto:robin.mur...@arm.com]
> Sent: Friday, July 17, 2020 8:55 PM
> To: Song Bao Hua (Barry Song) ; w...@kernel.org;
> j...@8bytes.org
> Cc: linux-ker...@vger.kernel.org; Linuxarm ;
> linux-arm-ker...@lists.infradead.org; iommu@lists.linux-foundation.org;
> Zengtao (B) 
> Subject: Re: [PATCH] iommu/arm-smmu-v3: remove the approach of MSI
> polling for CMD SYNC
> 
> On 2020-07-17 00:07, Barry Song wrote:
> > Before commit 587e6c10a7ce ("iommu/arm-smmu-v3: Reduce contention
> during
> > command-queue insertion"), msi polling perhaps performed better since
> > it could run outside the spin_lock_irqsave() while the code polling cons
> > reg was running in the lock.
> >
> > But after the great reorganization of smmu queue, neither of these two
> > polling methods are running in a spinlock. And real tests show polling
> > cons reg via sev means smaller latency. It is probably because polling
> > by msi will ask hardware to write memory but sev polling depends on the
> > update of register only.
> >
> > Using 16 threads to run netperf on hns3 100G NIC with UDP packet size
> > in 32768bytes and set iommu to strict, TX throughput can improve from
> > 25227.74Mbps to 27145.59Mbps by this patch. In this case, SMMU is super
> > busy as hns3 sends map/unmap requests extremely frequently.
> 
> How many different systems and SMMU implementations are those numbers
> representative of? Given that we may have cases where the SMMU can use
> MSIs but can't use SEV, so would have to fall back to inefficient
> busy-polling, I'd be wary of removing this entirely. Allowing particular
> platforms or SMMU implementations to suppress MSI functionality if they
> know for sure it makes sense seems like a safer bet.
> 
Hello Robin,

Thanks for taking a look. Actually I was really struggling with the good way to 
make every platform happy.
And I don't have other platforms to test and check if those platforms run 
better by sev polling. Even two
platforms have completely same SMMU features, it is still possible they behave 
differently.
So I simply sent this patch to get the discussion started to get opinions.

At the first beginning, I wanted to have a module parameter for users to decide 
if msi polling should be disabled.
But the module parameter might be totally ignored by linux distro.

--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -418,6 +418,11 @@ module_param_named(disable_bypass, disable_bypass, bool, 
S_IRUGO);  MODULE_PARM_DESC(disable_bypass,
"Disable bypass streams such that incoming transactions from devices 
that are not attached to an iommu domain will report an abort back to the 
device and will not be allowed to pass through the SMMU.");
 
+static bool disable_msipolling = 1;
+module_param_named(disable_msipolling, disable_msipolling, bool, 
+S_IRUGO); MODULE_PARM_DESC(disable_msipolling,
+   "Don't use MSI to poll the completion of CMD_SYNC if it is slower than 
+SEV");
+
 enum pri_resp {
PRI_RESP_DENY = 0,
PRI_RESP_FAIL = 1,
@@ -992,7 +997,7 @@ static void arm_smmu_cmdq_build_sync_cmd(u64 *cmd, struct 
arm_smmu_device *smmu,
 * Beware that Hi16xx adds an extra 32 bits of goodness to its MSI
 * payload, so the write will zero the entire command on that platform.
 */
-   if (smmu->features & ARM_SMMU_FEAT_MSI &&
+   if (!disable_msipolling && smmu->features & ARM_SMMU_FEAT_MSI &&
smmu->features & ARM_SMMU_FEAT_COHERENCY) {
ent.sync.msiaddr = q->base_dma + Q_IDX(>llq, prod) *
   q->ent_dwords * 8;
@@ -1332,7 +1337,7 @@ static int __arm_smmu_cmdq_poll_until_consumed(struct 
arm_smmu_device *smmu,  static int arm_smmu_cmdq_poll_until_sync(struct 
arm_smmu_device *smmu,
 struct arm_smmu_ll_queue *llq)
 {
-   if (smmu->features & ARM_SMMU_FEAT_MSI &&
+   if (!disable_msipolling && smmu->features & ARM_SMMU_FEAT_MSI &&
smmu->features & ARM_SMMU_FEAT_COHERENCY)
return __arm_smmu_cmdq_poll_until_msi(smmu, llq);


Another option is that we don't use module parameter, alternatively, we check 
the vendor/chip ID,
if the chip has better performance on sev polling, it may set 
disable_msipolling to true.

You are very welcome to give your suggestions.

Thanks
Barry
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v3 0/2] make dma_alloc_coherent NUMA-aware by per-NUMA CMA

2020-07-12 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Song Bao Hua (Barry Song)
> Sent: Sunday, June 28, 2020 11:13 PM
> To: h...@lst.de; m.szyprow...@samsung.com; robin.mur...@arm.com;
> w...@kernel.org; ganapatrao.kulka...@cavium.com;
> catalin.mari...@arm.com
> Cc: iommu@lists.linux-foundation.org; Linuxarm ;
> linux-arm-ker...@lists.infradead.org; linux-ker...@vger.kernel.org; Song Bao
> Hua (Barry Song) 
> Subject: [PATCH v3 0/2] make dma_alloc_coherent NUMA-aware by
> per-NUMA CMA
> 
> Ganapatrao Kulkarni has put some effort on making arm-smmu-v3 use local
> memory to save command queues[1]. I also did similar job in patch
> "iommu/arm-smmu-v3: allocate the memory of queues in local numa node"
> [2] while not realizing Ganapatrao has done that before.
> 
> But it seems it is much better to make dma_alloc_coherent() to be inherently
> NUMA-aware on NUMA-capable systems.
> 
> Right now, smmu is using dma_alloc_coherent() to get memory to save queues
> and tables. Typically, on ARM64 server, there is a default CMA located at
> node0, which could be far away from node2, node3 etc.
> Saving queues and tables remotely will increase the latency of ARM SMMU
> significantly. For example, when SMMU is at node2 and the default global
> CMA is at node0, after sending a CMD_SYNC in an empty command queue, we
> have to wait more than 550ns for the completion of the command
> CMD_SYNC.
> However, if we save them locally, we only need to wait for 240ns.
> 
> with per-numa CMA, smmu will get memory from local numa node to save
> command queues and page tables. that means dma_unmap latency will be
> shrunk much.
> 
> Meanwhile, when iommu.passthrough is on, device drivers which call dma_
> alloc_coherent() will also get local memory and avoid the travel between
> numa nodes.
> 
> [1]
> https://lists.linuxfoundation.org/pipermail/iommu/2017-October/024455.htm
> l
> [2] https://www.spinics.net/lists/iommu/msg44767.html
> 
> -v3:
>   * move to use page_to_nid() while freeing cma with respect to Robin's
>   comment, but this will only work after applying my below patch:
>   "mm/cma.c: use exact_nid true to fix possible per-numa cma leak"
>   https://marc.info/?l=linux-mm=159333034726647=2
> 
>   * handle the case count <= 1 more properly according to Robin's
>   comment;
> 
>   * add pernuma_cma parameter to support dynamic setting of per-numa
>   cma size;
>   ideally we can leverage the CMA_SIZE_MBYTES, CMA_SIZE_PERCENTAGE and
>   "cma=" kernel parameter and avoid a new paramter separately for per-
>   numa cma. Practically, it is really too complicated considering the
>   below problems:
>   (1) if we leverage the size of default numa for per-numa, we have to
>   avoid creating two cma with same size in node0 since default cma is
>   probably on node0.
>   (2) default cma can consider the address limitation for old devices
>   while per-numa cma doesn't support GFP_DMA and GFP_DMA32. all
>   allocations with limitation flags will fallback to default one.
>   (3) hard to apply CMA_SIZE_PERCENTAGE to per-numa. it is hard to
>   decide if the percentage should apply to the whole memory size
>   or only apply to the memory size of a specific numa node.
>   (4) default cma size has CMA_SIZE_SEL_MIN and CMA_SIZE_SEL_MAX, it
>   makes things even more complicated to per-numa cma.
> 
>   I haven't figured out a good way to leverage the size of default cma
>   for per-numa cma. it seems a separate parameter for per-numa could
>   make life easier.
> 
>   * move dma_pernuma_cma_reserve() after hugetlb_cma_reserve() to
>   reuse the comment before hugetlb_cma_reserve() with respect to
>   Robin's comment
> 
> -v2:
>   * fix some issues reported by kernel test robot
>   * fallback to default cma while allocation fails in per-numa cma
>  free memory properly
> 
> Barry Song (2):
>   dma-direct: provide the ability to reserve per-numa CMA
>   arm64: mm: reserve per-numa CMA to localize coherent dma buffers
> 
>  .../admin-guide/kernel-parameters.txt |  9 ++
>  arch/arm64/mm/init.c  |  2 +
>  include/linux/dma-contiguous.h|  4 +
>  kernel/dma/Kconfig| 10 ++
>  kernel/dma/contiguous.c   | 98
> +--
>  5 files changed, 114 insertions(+), 9 deletions(-)

Gentle ping :-)

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH net] xsk: remove cheap_dma optimization

2020-07-08 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org]
> On Behalf Of Christoph Hellwig
> Sent: Wednesday, July 8, 2020 6:50 PM
> To: Robin Murphy 
> Cc: Björn Töpel ; Christoph Hellwig ;
> Daniel Borkmann ; maxi...@mellanox.com;
> konrad.w...@oracle.com; jonathan.le...@gmail.com;
> linux-ker...@vger.kernel.org; iommu@lists.linux-foundation.org;
> net...@vger.kernel.org; b...@vger.kernel.org; da...@davemloft.net;
> magnus.karls...@intel.com
> Subject: Re: [PATCH net] xsk: remove cheap_dma optimization
> 
> On Mon, Jun 29, 2020 at 04:41:16PM +0100, Robin Murphy wrote:
> > On 2020-06-28 18:16, Björn Töpel wrote:
> >>
> >> On 2020-06-27 09:04, Christoph Hellwig wrote:
> >>> On Sat, Jun 27, 2020 at 01:00:19AM +0200, Daniel Borkmann wrote:
>  Given there is roughly a ~5 weeks window at max where this removal
> could
>  still be applied in the worst case, could we come up with a fix /
>  proposal
>  first that moves this into the DMA mapping core? If there is something
>  that
>  can be agreed upon by all parties, then we could avoid re-adding the 9%
>  slowdown. :/
> >>>
> >>> I'd rather turn it upside down - this abuse of the internals blocks work
> >>> that has basically just missed the previous window and I'm not going
> >>> to wait weeks to sort out the API misuse.  But we can add optimizations
> >>> back later if we find a sane way.
> >>>
> >>
> >> I'm not super excited about the performance loss, but I do get
> >> Christoph's frustration about gutting the DMA API making it harder for
> >> DMA people to get work done. Lets try to solve this properly using
> >> proper DMA APIs.
> >>
> >>
> >>> That being said I really can't see how this would make so much of a
> >>> difference.  What architecture and what dma_ops are you using for
> >>> those measurements?  What is the workload?
> >>>
> >>
> >> The 9% is for an AF_XDP (Fast raw Ethernet socket. Think AF_PACKET, but
> >> faster.) benchmark: receive the packet from the NIC, and drop it. The DMA
> >> syncs stand out in the perf top:
> >>
> >>    28.63%  [kernel]   [k] i40e_clean_rx_irq_zc
> >>    17.12%  [kernel]   [k] xp_alloc
> >>     8.80%  [kernel]   [k] __xsk_rcv_zc
> >>     7.69%  [kernel]   [k] xdp_do_redirect
> >>     5.35%  bpf_prog_992d9ddc835e5629  [k]
> bpf_prog_992d9ddc835e5629
> >>     4.77%  [kernel]   [k] xsk_rcv.part.0
> >>     4.07%  [kernel]   [k] __xsk_map_redirect
> >>     3.80%  [kernel]   [k]
> dma_direct_sync_single_for_cpu
> >>     3.03%  [kernel]   [k]
> dma_direct_sync_single_for_device
> >>     2.76%  [kernel]   [k]
> i40e_alloc_rx_buffers_zc
> >>     1.83%  [kernel]   [k] xsk_flush
> >> ...
> >>
> >> For this benchmark the dma_ops are NULL (dma_is_direct() == true), and
> >> the main issue is that SWIOTLB is now unconditionally enabled [1] for
> >> x86, and for each sync we have to check that if is_swiotlb_buffer()
> >> which involves a some costly indirection.
> >>
> >> That was pretty much what my hack avoided. Instead we did all the checks
> >> upfront, since AF_XDP has long-term DMA mappings, and just set a flag
> >> for that.
> >>
> >> Avoiding the whole "is this address swiotlb" in
> >> dma_direct_sync_single_for_{cpu, device]() per-packet
> >> would help a lot.
> >
> > I'm pretty sure that's one of the things we hope to achieve with the
> > generic bypass flag :)
> >
> >> Somewhat related to the DMA API; It would have performance benefits for
> >> AF_XDP if the DMA range of the mapped memory was linear, i.e. by IOMMU
> >> utilization. I've started hacking a thing a little bit, but it would be
> >> nice if such API was part of the mapping core.
> >>
> >> Input: array of pages Output: array of dma addrs (and obviously dev,
> >> flags and such)
> >>
> >> For non-IOMMU len(array of pages) == len(array of dma addrs)
> >> For best-case IOMMU len(array of dma addrs) == 1 (large linear space)
> >>
> >> But that's for later. :-)
> >
> > FWIW you will typically get that behaviour from IOMMU-based
> implementations
> > of dma_map_sg() right now, although it's not strictly guaranteed. If you
> > can weather some additional setup cost of calling
> > sg_alloc_table_from_pages() plus walking the list after mapping to test
> > whether you did get a contiguous result, you could start taking advantage
> > of it as some of the dma-buf code in DRM and v4l2 does already (although
> > those cases actually treat it as a strict dependency rather than an
> > optimisation).
> 
> Yikes.
> 
> > I'm inclined to agree that if we're going to see more of these cases, a new
> > API call that did formally guarantee a DMA-contiguous mapping (either via
> > IOMMU or bounce buffering) or failure might indeed be handy.
> 
> I was planning on adding a dma-level API to add more pages to an
> IOMMU batch, but was waiting for at 

RE: [PATCH] iommu/arm-smmu-v3: allocate the memory of queues in local numa node

2020-07-05 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Will Deacon [mailto:w...@kernel.org]
> Sent: Saturday, July 4, 2020 4:22 AM
> To: Song Bao Hua (Barry Song) 
> Cc: h...@lst.de; m.szyprow...@samsung.com; robin.mur...@arm.com;
> linux-arm-ker...@lists.infradead.org; iommu@lists.linux-foundation.org;
> Linuxarm 
> Subject: Re: [PATCH] iommu/arm-smmu-v3: allocate the memory of queues in
> local numa node
> 
> On Mon, Jun 01, 2020 at 11:31:41PM +1200, Barry Song wrote:
> > dmam_alloc_coherent() will usually allocate memory from the default CMA.
> For
> > a common arm64 defconfig without reserved memory in device tree, there is
> only
> > one CMA close to address 0.
> > dma_alloc_contiguous() will allocate memory without any idea of  NUMA
> and smmu
> > has no customized per-numa cma_area.
> > struct page *dma_alloc_contiguous(struct device *dev, size_t size, gfp_t 
> > gfp)
> > {
> > size_t count = size >> PAGE_SHIFT;
> > struct page *page = NULL;
> > struct cma *cma = NULL;
> >
> > if (dev && dev->cma_area)
> > cma = dev->cma_area;
> > else if (count > 1)
> > cma = dma_contiguous_default_area;
> >
> > ...
> > return page;
> > }
> >
> > if there are N numa nodes, N-1 nodes will put command/evt queues etc in a
> > remote node the default CMA belongs to,probably node 0. Tests show, after
> > sending CMD_SYNC in an empty command queue,
> > on Node2, without this patch, it takes 550ns to wait for the completion
> > of CMD_SYNC; with this patch, it takes 250ns to wait for the completion
> > of CMD_SYNC.
> >
> > Signed-off-by: Barry Song 
> > ---
> >  drivers/iommu/arm-smmu-v3.c | 63
> -
> >  1 file changed, 48 insertions(+), 15 deletions(-)
> 
> I would prefer that the coherent DMA allocator learned about NUMA, rather
> than we bodge drivers to use the streaming API where it doesn't really
> make sense.
> 
> I see that you've posted other patches to do that (thanks!), so I'll
> disregard this series.

Thanks for taking a look, Will. For sure I am using the per-numa cma patchset to
replace this patch. So it is ok to ignore this one.


> 
> Cheers,
> 
> Will

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH] iommu/arm-smmu-v3: expose numa_node attribute to users in sysfs

2020-07-05 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Will Deacon [mailto:w...@kernel.org]
> Sent: Saturday, July 4, 2020 4:22 AM
> To: Song Bao Hua (Barry Song) 
> Cc: robin.mur...@arm.com; h...@lst.de; m.szyprow...@samsung.com;
> iommu@lists.linux-foundation.org; linux-arm-ker...@lists.infradead.org;
> Linuxarm 
> Subject: Re: [PATCH] iommu/arm-smmu-v3: expose numa_node attribute to
> users in sysfs
> 
> On Sat, May 30, 2020 at 09:15:05PM +1200, Barry Song wrote:
> > As tests show the latency of dma_unmap can increase dramatically while
> > calling them cross NUMA nodes, especially cross CPU packages, eg.
> > 300ns vs 800ns while waiting for the completion of CMD_SYNC in an
> > empty command queue. The large latency causing by remote node will
> > in turn make contention of the command queue more serious, and enlarge
> > the latency of DMA users within local NUMA nodes.
> >
> > Users might intend to enforce NUMA locality with the consideration of
> > the position of SMMU. The patch provides minor benefit by presenting
> > this information to users directly, as they might want to know it without
> > checking hardware spec at all.
> 
> I don't think that's a very good reason to expose things to userspace.
> I know sysfs shouldn't be treated as ABI, but the grim reality is that
> once somebody relies on this stuff then we can't change it, so I'd
> rather avoid exposing it unless it's absolutely necessary.

Will, thanks for taking a look!

I am not sure if it is absolutely necessary, but it is useful to users. The 
whole story started
from some users who wanted to know the hardware topology very clear by reading 
some
sysfs node just like they are able to do that for pci devices. The intention is 
that users can
know hardware topology of various devices easily from linux since they maybe 
don't know
all the hardware details.

For pci devices, kernel has done that. And there are some other drivers out of 
pci
exposing numa_node as well. It seems it is hard to say it is absolutely 
necessary
for them too since sysfs shouldn't be treated as ABI. 

I got some input from Linux users who also wanted to know the numa node for
other devices which are not PCI, for example, platform devices. And I thought 
the
requirement is kind of reasonable. So I also had another patch to generally 
support
this kind of requirements, with the below patch, this smmu patch is not 
necessary
any more:
https://lkml.org/lkml/2020/6/18/1257

for platform device created by ARM ACPI/IORT and general 
acpi_create_platform_device()
drivers/acpi/scan.c:
static void acpi_default_enumeration(struct acpi_device *device)
{
...
if (!device->flags.enumeration_by_parent) {
acpi_create_platform_device(device, NULL);
acpi_device_set_enumerated(device);
}
}

struct platform_device *acpi_create_platform_device(struct acpi_device *adev,
struct property_entry *properties)
{
...

pdev = platform_device_register_full();
if (IS_ERR(pdev))
...
else {
set_dev_node(>dev, acpi_get_node(adev->handle));
...
}
...
}
numa_node is set for this kind of devices.

Anyway, just want to explain to you the background some people want to know the 
hardware topology from Linux in same simple way. And it seems it is a reasonable
requirement to me :-)

> 
> Thanks,
> 
> Will

Thanks
barry
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v2 1/2] dma-direct: provide the ability to reserve per-numa CMA

2020-06-26 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org
> [mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Robin Murphy
> Sent: Thursday, June 25, 2020 11:11 PM
> To: Song Bao Hua (Barry Song) ; h...@lst.de;
> m.szyprow...@samsung.com; w...@kernel.org;
> ganapatrao.kulka...@cavium.com; catalin.mari...@arm.com
> Cc: iommu@lists.linux-foundation.org; Linuxarm ;
> linux-arm-ker...@lists.infradead.org; linux-ker...@vger.kernel.org; Jonathan
> Cameron ; Nicolas Saenz Julienne
> ; Steve Capper ; Andrew
> Morton ; Mike Rapoport 
> Subject: Re: [PATCH v2 1/2] dma-direct: provide the ability to reserve
> per-numa CMA
> 
> On 2020-06-25 08:43, Barry Song wrote:
> > This is useful for at least two scenarios:
> > 1. ARM64 smmu will get memory from local numa node, it can save its
> > command queues and page tables locally. Tests show it can decrease
> > dma_unmap latency at lot. For example, without this patch, smmu on
> > node2 will get memory from node0 by calling dma_alloc_coherent(),
> > typically, it has to wait for more than 560ns for the completion of
> > CMD_SYNC in an empty command queue; with this patch, it needs 240ns
> > only.
> > 2. when we set iommu passthrough, drivers will get memory from CMA,
> > local memory means much less latency.
> >
> > Cc: Jonathan Cameron 
> > Cc: Christoph Hellwig 
> > Cc: Marek Szyprowski 
> > Cc: Will Deacon 
> > Cc: Robin Murphy 
> > Cc: Ganapatrao Kulkarni 
> > Cc: Catalin Marinas 
> > Cc: Nicolas Saenz Julienne 
> > Cc: Steve Capper 
> > Cc: Andrew Morton 
> > Cc: Mike Rapoport 
> > Signed-off-by: Barry Song 
> > ---
> >   include/linux/dma-contiguous.h |  4 ++
> >   kernel/dma/Kconfig | 10 
> >   kernel/dma/contiguous.c| 99
> ++
> >   3 files changed, 104 insertions(+), 9 deletions(-)
> >
> > diff --git a/include/linux/dma-contiguous.h
> b/include/linux/dma-contiguous.h
> > index 03f8e98e3bcc..278a80a40456 100644
> > --- a/include/linux/dma-contiguous.h
> > +++ b/include/linux/dma-contiguous.h
> > @@ -79,6 +79,8 @@ static inline void dma_contiguous_set_default(struct
> cma *cma)
> >
> >   void dma_contiguous_reserve(phys_addr_t addr_limit);
> >
> > +void dma_pernuma_cma_reserve(void);
> > +
> >   int __init dma_contiguous_reserve_area(phys_addr_t size, phys_addr_t
> base,
> >phys_addr_t limit, struct cma **res_cma,
> >bool fixed);
> > @@ -128,6 +130,8 @@ static inline void dma_contiguous_set_default(struct
> cma *cma) { }
> >
> >   static inline void dma_contiguous_reserve(phys_addr_t limit) { }
> >
> > +static inline void dma_pernuma_cma_reserve(void) { }
> > +
> >   static inline int dma_contiguous_reserve_area(phys_addr_t size,
> phys_addr_t base,
> >phys_addr_t limit, struct cma **res_cma,
> >bool fixed)
> > diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
> > index d006668c0027..aeb976b1d21c 100644
> > --- a/kernel/dma/Kconfig
> > +++ b/kernel/dma/Kconfig
> > @@ -104,6 +104,16 @@ config DMA_CMA
> >   if  DMA_CMA
> >   comment "Default contiguous memory area size:"
> >
> > +config CMA_PERNUMA_SIZE_MBYTES
> > +   int "Size in Mega Bytes for per-numa CMA areas"
> > +   depends on NUMA
> > +   default 16 if ARM64
> > +   default 0
> > +   help
> > + Defines the size (in MiB) of the per-numa memory area for Contiguous
> > + Memory Allocator. Every numa node will get a separate CMA with this
> > + size. If the size of 0 is selected, per-numa CMA is disabled.
> > +
> 
> I think this needs to be cleverer than just a static config option.
> Pretty much everything else CMA-related is runtime-configurable to some
> degree, and doing any per-node setup when booting on a single-node
> system would be wasted effort.

I agree some dynamic configuration should be supported to set the size of cma.
It could be a kernel parameter bootargs, or leverage an existing parameter.

For a system with NUMA enabled, but with only one node or actually non-numa, 
the current dma_pernuma_cma_reserve() won't do anything:

void __init dma_pernuma_cma_reserve(void)
{
int nid;

if (!pernuma_size_bytes || nr_online_nodes <= 1)
return;
}

> 
> Since this is conceptually very similar to the existing hugetlb_cma
> implementation I'm also wondering about inconsistency with respect to
> specifying per-node vs.

RE: [PATCH v2 2/2] arm64: mm: reserve per-numa CMA after numa_init

2020-06-25 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Robin Murphy [mailto:robin.mur...@arm.com]
> Sent: Thursday, June 25, 2020 11:16 PM
> To: Song Bao Hua (Barry Song) ; h...@lst.de;
> m.szyprow...@samsung.com; w...@kernel.org;
> ganapatrao.kulka...@cavium.com; catalin.mari...@arm.com
> Cc: iommu@lists.linux-foundation.org; Linuxarm ;
> linux-arm-ker...@lists.infradead.org; linux-ker...@vger.kernel.org; Nicolas
> Saenz Julienne ; Steve Capper
> ; Andrew Morton ;
> Mike Rapoport 
> Subject: Re: [PATCH v2 2/2] arm64: mm: reserve per-numa CMA after
> numa_init
> 
> On 2020-06-25 08:43, Barry Song wrote:
> > Right now, smmu is using dma_alloc_coherent() to get memory to save
> queues
> > and tables. Typically, on ARM64 server, there is a default CMA located at
> > node0, which could be far away from node2, node3 etc.
> > with this patch, smmu will get memory from local numa node to save
> command
> > queues and page tables. that means dma_unmap latency will be shrunk
> much.
> > Meanwhile, when iommu.passthrough is on, device drivers which call dma_
> > alloc_coherent() will also get local memory and avoid the travel between
> > numa nodes.
> >
> > Cc: Christoph Hellwig 
> > Cc: Marek Szyprowski 
> > Cc: Will Deacon 
> > Cc: Robin Murphy 
> > Cc: Ganapatrao Kulkarni 
> > Cc: Catalin Marinas 
> > Cc: Nicolas Saenz Julienne 
> > Cc: Steve Capper 
> > Cc: Andrew Morton 
> > Cc: Mike Rapoport 
> > Signed-off-by: Barry Song 
> > ---
> >   arch/arm64/mm/init.c | 2 ++
> >   1 file changed, 2 insertions(+)
> >
> > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> > index 1e93cfc7c47a..07d4d1fe7983 100644
> > --- a/arch/arm64/mm/init.c
> > +++ b/arch/arm64/mm/init.c
> > @@ -420,6 +420,8 @@ void __init bootmem_init(void)
> >
> > arm64_numa_init();
> >
> > +   dma_pernuma_cma_reserve();
> > +
> 
> It might be worth putting this after the hugetlb_cma_reserve() call for
> clarity, since the comment below applies equally to this call too.

Yep, it looks even better though dma_pernuma_cma_reserve() is self-documenting 
by name.

> 
> Robin.
> 
> > /*
> >  * must be done after arm64_numa_init() which calls numa_init() to
> >  * initialize node_online_map that gets used in hugetlb_cma_reserve()
> >
Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH 2/3] arm64: mm: reserve hugetlb CMA after numa_init

2020-06-07 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Matthias Brugger [mailto:matthias@gmail.com]
> Sent: Monday, June 8, 2020 8:15 AM
> To: Roman Gushchin ; Song Bao Hua (Barry Song)
> 
> Cc: catalin.mari...@arm.com; John Garry ;
> linux-ker...@vger.kernel.org; Linuxarm ;
> iommu@lists.linux-foundation.org; Zengtao (B) ;
> Jonathan Cameron ;
> robin.mur...@arm.com; h...@lst.de; linux-arm-ker...@lists.infradead.org;
> m.szyprow...@samsung.com
> Subject: Re: [PATCH 2/3] arm64: mm: reserve hugetlb CMA after numa_init
> 
> 
> 
> On 03/06/2020 05:22, Roman Gushchin wrote:
> > On Wed, Jun 03, 2020 at 02:42:30PM +1200, Barry Song wrote:
> >> hugetlb_cma_reserve() is called at the wrong place. numa_init has not been
> >> done yet. so all reserved memory will be located at node0.
> >>
> >> Cc: Roman Gushchin 
> >> Signed-off-by: Barry Song 
> >
> > Acked-by: Roman Gushchin 
> >
> 
> When did this break or was it broken since the beginning?
> In any case, could you provide a "Fixes" tag for it, so that it can easily be
> backported to older releases.

I guess it was broken at the first beginning.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=cf11e85fc08cc

Fixes: cf11e85fc08c ("mm: hugetlb: optionally allocate gigantic hugepages using 
cma")

Would you think it is better for me to send v2 for this patch separately with 
this tag and take this out of my original patch set for per-numa CMA?
Please give your suggestion.

Best Regards
Barry

> 
> Regards,
> Matthias
> 
> > Thanks!
> >
> >> ---
> >>  arch/arm64/mm/init.c | 10 +-
> >>  1 file changed, 5 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> >> index e42727e3568e..8f0e70ebb49d 100644
> >> --- a/arch/arm64/mm/init.c
> >> +++ b/arch/arm64/mm/init.c
> >> @@ -458,11 +458,6 @@ void __init arm64_memblock_init(void)
> >>high_memory = __va(memblock_end_of_DRAM() - 1) + 1;
> >>
> >>dma_contiguous_reserve(arm64_dma32_phys_limit);
> >> -
> >> -#ifdef CONFIG_ARM64_4K_PAGES
> >> -  hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
> >> -#endif
> >> -
> >>  }
> >>
> >>  void __init bootmem_init(void)
> >> @@ -478,6 +473,11 @@ void __init bootmem_init(void)
> >>min_low_pfn = min;
> >>
> >>arm64_numa_init();
> >> +
> >> +#ifdef CONFIG_ARM64_4K_PAGES
> >> +  hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
> >> +#endif
> >> +
> >>/*
> >> * Sparsemem tries to allocate bootmem in memory_present(), so must
> be
> >> * done after the fixed reservations.
> >> --
> >> 2.23.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [kbuild-all] Re: [PATCH 1/3] dma-direct: provide the ability to reserve per-numa CMA

2020-06-06 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Philip Li [mailto:philip...@intel.com]
> Sent: Saturday, June 6, 2020 3:47 PM
> To: Dan Carpenter 
> Cc: Song Bao Hua (Barry Song) ;
> kbu...@lists.01.org; h...@lst.de; m.szyprow...@samsung.com;
> robin.mur...@arm.com; catalin.mari...@arm.com; l...@intel.com; Dan
> Carpenter ; kbuild-...@lists.01.org;
> iommu@lists.linux-foundation.org; linux-arm-ker...@lists.infradead.org;
> linux-ker...@vger.kernel.org; Linuxarm ; Jonathan
> Cameron ; John Garry
> 
> Subject: Re: [kbuild-all] Re: [PATCH 1/3] dma-direct: provide the ability to
> reserve per-numa CMA
> 
> On Fri, Jun 05, 2020 at 11:57:51AM +0300, Dan Carpenter wrote:
> > On Fri, Jun 05, 2020 at 06:04:31AM +, Song Bao Hua (Barry Song)
> wrote:
> > >
> > >
> > > > -Original Message-
> > > > From: Dan Carpenter [mailto:dan.carpen...@oracle.com]
> > > > Sent: Thursday, June 4, 2020 11:37 PM
> > > > To: kbu...@lists.01.org; Song Bao Hua (Barry Song)
> > > > ; h...@lst.de;
> > > > m.szyprow...@samsung.com; robin.mur...@arm.com;
> > > > catalin.mari...@arm.com
> > > > Cc: l...@intel.com; Dan Carpenter ;
> > > > kbuild-...@lists.01.org; iommu@lists.linux-foundation.org;
> > > > linux-arm-ker...@lists.infradead.org;
> > > > linux-ker...@vger.kernel.org; Linuxarm ;
> > > > Jonathan Cameron ; John Garry
> > > > 
> > > > Subject: Re: [PATCH 1/3] dma-direct: provide the ability to
> > > > reserve per-numa CMA
> > > >
> > > > Hi Barry,
> > > >
> > > > url:
> > > > https://github.com/0day-ci/linux/commits/Barry-Song/support-per-nu
> > > > ma-CM
> > > > A-for-ARM-server/20200603-104821
> > > > base:   https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git
> > > > for-next/core
> > > > config: x86_64-randconfig-m001-20200603 (attached as .config)
> > > > compiler: gcc-9 (Debian 9.3.0-13) 9.3.0
> > > >
> > > > If you fix the issue, kindly add following tag as appropriate
> > > > Reported-by: kernel test robot 
> > > > Reported-by: Dan Carpenter 
> > >
> > > Dan, thanks! Good catch!
> > > as this patch has not been in mainline yet, is it correct to add these
> "reported-by" in patch v2?
> Hi Barry, we provides the suggestion here, but you may decide to add or not as
> appropriate for your situation. For the patch still under development, it is 
> not
> that necessary to add i think.

Hi Philip, Dan,
Thanks for clarification.
> 
> >
> > These are just an automated email from the zero day bot.  I look over
> > the Smatch warnings and then forward them on.
> >
> > regards,
> > dan carpenter

Best regards
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH 1/3] dma-direct: provide the ability to reserve per-numa CMA

2020-06-05 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Dan Carpenter [mailto:dan.carpen...@oracle.com]
> Sent: Thursday, June 4, 2020 11:37 PM
> To: kbu...@lists.01.org; Song Bao Hua (Barry Song)
> ; h...@lst.de; m.szyprow...@samsung.com;
> robin.mur...@arm.com; catalin.mari...@arm.com
> Cc: l...@intel.com; Dan Carpenter ;
> kbuild-...@lists.01.org; iommu@lists.linux-foundation.org;
> linux-arm-ker...@lists.infradead.org; linux-ker...@vger.kernel.org; Linuxarm
> ; Jonathan Cameron
> ; John Garry 
> Subject: Re: [PATCH 1/3] dma-direct: provide the ability to reserve per-numa
> CMA
> 
> Hi Barry,
> 
> url:
> https://github.com/0day-ci/linux/commits/Barry-Song/support-per-numa-CM
> A-for-ARM-server/20200603-104821
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git
> for-next/core
> config: x86_64-randconfig-m001-20200603 (attached as .config)
> compiler: gcc-9 (Debian 9.3.0-13) 9.3.0
> 
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot 
> Reported-by: Dan Carpenter 

Dan, thanks! Good catch!
as this patch has not been in mainline yet, is it correct to add these 
"reported-by" in patch v2?

Barry

> 
> smatch warnings:
> kernel/dma/contiguous.c:274 dma_alloc_contiguous() warn: variable
> dereferenced before check 'dev' (see line 272)
> 
> #
> https://github.com/0day-ci/linux/commit/adb919e972c1cac3d8b11905d525
> 8d23d3aac6a4
> git remote add linux-review https://github.com/0day-ci/linux git remote
> update linux-review git checkout
> adb919e972c1cac3d8b11905d5258d23d3aac6a4
> vim +/dev +274 kernel/dma/contiguous.c
> 
> b1d2dc009dece4 kernel/dma/contiguous.c   Nicolin Chen
> 2019-05-23  267  struct page *dma_alloc_contiguous(struct device *dev,
> size_t size, gfp_t gfp)
> b1d2dc009dece4 kernel/dma/contiguous.c   Nicolin Chen
> 2019-05-23  268  {
> 90ae409f9eb3bc kernel/dma/contiguous.c   Christoph Hellwig
> 2019-08-20  269   size_t count = size >> PAGE_SHIFT;
> b1d2dc009dece4 kernel/dma/contiguous.c   Nicolin Chen
> 2019-05-23  270   struct page *page = NULL;
> bd2e75633c8012 kernel/dma/contiguous.c   Nicolin Chen
> 2019-05-23  271   struct cma *cma = NULL;
> adb919e972c1ca kernel/dma/contiguous.c   Barry Song
> 2020-06-03 @272   int nid = dev_to_node(dev);
> 
> ^^^ Dereferenced inside function.
> 
> bd2e75633c8012 kernel/dma/contiguous.c   Nicolin Chen
> 2019-05-23  273
> bd2e75633c8012 kernel/dma/contiguous.c   Nicolin Chen
> 2019-05-23 @274   if (dev && dev->cma_area)
> 
> ^^^ Too late.
> 
> bd2e75633c8012 kernel/dma/contiguous.c   Nicolin Chen
> 2019-05-23  275   cma = dev->cma_area;
> adb919e972c1ca kernel/dma/contiguous.c   Barry Song
> 2020-06-03  276   else if ((nid != NUMA_NO_NODE) &&
> dma_contiguous_pernuma_area[nid]
> adb919e972c1ca kernel/dma/contiguous.c   Barry Song
> 2020-06-03  277   && !(gfp & (GFP_DMA | GFP_DMA32)))
> adb919e972c1ca kernel/dma/contiguous.c   Barry Song
> 2020-06-03  278   cma = dma_contiguous_pernuma_area[nid];
> bd2e75633c8012 kernel/dma/contiguous.c   Nicolin Chen
> 2019-05-23  279   else if (count > 1)
> bd2e75633c8012 kernel/dma/contiguous.c   Nicolin Chen
> 2019-05-23  280   cma = dma_contiguous_default_area;
> b1d2dc009dece4 kernel/dma/contiguous.c   Nicolin Chen
> 2019-05-23  281
> b1d2dc009dece4 kernel/dma/contiguous.c   Nicolin Chen
> 2019-05-23  282   /* CMA can be used only in the context which permits
> sleeping */
> b1d2dc009dece4 kernel/dma/contiguous.c   Nicolin Chen
> 2019-05-23  283   if (cma && gfpflags_allow_blocking(gfp)) {
> 
> ---
> 0-DAY CI Kernel Test Service, Intel Corporation
> https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH] driver core: platform: expose numa_node to users in sysfs

2020-06-02 Thread Song Bao Hua (Barry Song)
> >
> > On Tue, Jun 02, 2020 at 05:09:57AM +0000, Song Bao Hua (Barry Song)
> wrote:
> > > > >
> > > > > Platform devices are NUMA?  That's crazy, and feels like a total
> > > > > abuse of platform devices and drivers that really should belong
> > > > > on a
> > "real"
> > > > > bus.
> > > >
> > > > I am not sure if it is an abuse of platform device. But smmu is a
> > > > platform device, drivers/iommu/arm-smmu-v3.c is a platform driver.
> > > > In a typical ARM server, there are maybe multiple SMMU devices
> > > > which can support IO virtual address and page tables for other
> > > > devices on PCI-like busses.
> > > > Each different SMMU device might be close to different NUMA node.
> > > > There is really a hardware topology.
> > > >
> > > > If you have multiple CPU packages in a NUMA server, some platform
> > > > devices might Belong to CPU0, some other might belong to CPU1.
> > >
> > > Those devices are populated by acpi_iort for an ARM server:
> > >
> > > drivers/acpi/arm64/iort.c:
> > >
> > > static const struct iort_dev_config iort_arm_smmu_v3_cfg __initconst = {
> > > .name = "arm-smmu-v3",
> > > .dev_dma_configure = arm_smmu_v3_dma_configure,
> > > .dev_count_resources = arm_smmu_v3_count_resources,
> > > .dev_init_resources = arm_smmu_v3_init_resources,
> > > .dev_set_proximity = arm_smmu_v3_set_proximity, };
> > >
> > > void __init acpi_iort_init(void)
> > > {
> > > acpi_status status;
> > >
> > > status = acpi_get_table(ACPI_SIG_IORT, 0, _table);
> > > ...
> > > iort_check_id_count_workaround(iort_table);
> > > iort_init_platform_devices(); }
> > >
> > > static void __init iort_init_platform_devices(void) {
> > > ...
> > >
> > > for (i = 0; i < iort->node_count; i++) {
> > > if (iort_node >= iort_end) {
> > > pr_err("iort node pointer overflows, bad
> > table\n");
> > > return;
> > > }
> > >
> > > iort_enable_acs(iort_node);
> > >
> > > ops = iort_get_dev_cfg(iort_node);
> > > if (ops) {
> > > fwnode = acpi_alloc_fwnode_static();
> > > if (!fwnode)
> > > return;
> > >
> > > iort_set_fwnode(iort_node, fwnode);
> > >
> > > ret = iort_add_platform_device(iort_node,
> ops);
> > > if (ret) {
> > > iort_delete_fwnode(iort_node);
> > > acpi_free_fwnode_static(fwnode);
> > > return;
> > > }
> > > }
> > >
> > > ...
> > > }
> > > ...
> > > }
> > >
> > > NUMA node is got from ACPI:
> > >
> > > static int  __init arm_smmu_v3_set_proximity(struct device *dev,
> > >   struct
> acpi_iort_node
> > > *node) {
> > > struct acpi_iort_smmu_v3 *smmu;
> > >
> > > smmu = (struct acpi_iort_smmu_v3 *)node->node_data;
> > > if (smmu->flags & ACPI_IORT_SMMU_V3_PXM_VALID) {
> > > int dev_node = acpi_map_pxm_to_node(smmu->pxm);
> > >
> > > ...
> > >
> > > set_dev_node(dev, dev_node);
> > > ...
> > > }
> > > return 0;
> > > }
> > >
> > > Barry
> >
> > That's fine, but those are "real" devices, not platform devices, right?
> >
> 
> Most platform devices are "real" memory-mapped hardware devices. For an
> embedded system, almost all "simple-bus"
> devices are populated from device trees as platform devices. Only a part of
> platform devices are not "real" hardware.
> 
> Smmu is a memory-mapped device. It is totally like most other platform
> devices populated in a memory space mapped in cpu's local space. It uses
> ioremap to map registers

RE: [PATCH] driver core: platform: expose numa_node to users in sysfs

2020-06-02 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Greg KH [mailto:gre...@linuxfoundation.org]
> Sent: Tuesday, June 2, 2020 6:11 PM
> To: Song Bao Hua (Barry Song) 
> Cc: raf...@kernel.org; iommu@lists.linux-foundation.org;
> linux-arm-ker...@lists.infradead.org; linux-ker...@vger.kernel.org; Linuxarm
> ; Zengtao (B) ; Robin
> Murphy 
> Subject: Re: [PATCH] driver core: platform: expose numa_node to users in sysfs
> 
> On Tue, Jun 02, 2020 at 05:09:57AM +, Song Bao Hua (Barry Song) wrote:
> > > >
> > > > Platform devices are NUMA?  That's crazy, and feels like a total
> > > > abuse of platform devices and drivers that really should belong on a
> "real"
> > > > bus.
> > >
> > > I am not sure if it is an abuse of platform device. But smmu is a
> > > platform device, drivers/iommu/arm-smmu-v3.c is a platform driver.
> > > In a typical ARM server, there are maybe multiple SMMU devices which
> > > can support IO virtual address and page tables for other devices on
> > > PCI-like busses.
> > > Each different SMMU device might be close to different NUMA node.
> > > There is really a hardware topology.
> > >
> > > If you have multiple CPU packages in a NUMA server, some platform
> > > devices might Belong to CPU0, some other might belong to CPU1.
> >
> > Those devices are populated by acpi_iort for an ARM server:
> >
> > drivers/acpi/arm64/iort.c:
> >
> > static const struct iort_dev_config iort_arm_smmu_v3_cfg __initconst = {
> > .name = "arm-smmu-v3",
> > .dev_dma_configure = arm_smmu_v3_dma_configure,
> > .dev_count_resources = arm_smmu_v3_count_resources,
> > .dev_init_resources = arm_smmu_v3_init_resources,
> > .dev_set_proximity = arm_smmu_v3_set_proximity, };
> >
> > void __init acpi_iort_init(void)
> > {
> > acpi_status status;
> >
> > status = acpi_get_table(ACPI_SIG_IORT, 0, _table);
> > ...
> > iort_check_id_count_workaround(iort_table);
> > iort_init_platform_devices();
> > }
> >
> > static void __init iort_init_platform_devices(void) {
> > ...
> >
> > for (i = 0; i < iort->node_count; i++) {
> > if (iort_node >= iort_end) {
> > pr_err("iort node pointer overflows, bad
> table\n");
> > return;
> > }
> >
> > iort_enable_acs(iort_node);
> >
> > ops = iort_get_dev_cfg(iort_node);
> > if (ops) {
> > fwnode = acpi_alloc_fwnode_static();
> > if (!fwnode)
> > return;
> >
> > iort_set_fwnode(iort_node, fwnode);
> >
> > ret = iort_add_platform_device(iort_node, ops);
> > if (ret) {
> > iort_delete_fwnode(iort_node);
> > acpi_free_fwnode_static(fwnode);
> > return;
> > }
> > }
> >
> > ...
> > }
> > ...
> > }
> >
> > NUMA node is got from ACPI:
> >
> > static int  __init arm_smmu_v3_set_proximity(struct device *dev,
> >   struct acpi_iort_node
> > *node) {
> > struct acpi_iort_smmu_v3 *smmu;
> >
> > smmu = (struct acpi_iort_smmu_v3 *)node->node_data;
> > if (smmu->flags & ACPI_IORT_SMMU_V3_PXM_VALID) {
> > int dev_node = acpi_map_pxm_to_node(smmu->pxm);
> >
> > ...
> >
> > set_dev_node(dev, dev_node);
> > ...
> > }
> > return 0;
> > }
> >
> > Barry
> 
> That's fine, but those are "real" devices, not platform devices, right?
> 

Most platform devices are "real" memory-mapped hardware devices. For an 
embedded system, almost all "simple-bus"
devices are populated from device trees as platform devices. Only a part of 
platform devices are not "real" hardware.

Smmu is a memory-mapped device. It is totally like most other platform devices 
populated in a 
memory space mapped in cpu's local space. It uses ioremap to map registers, use 
readl/writel to read/write its
space.

> What platform device has this issue?  What one will show up this way with
> the new patch?

if platform device shouldn't be a real hardware, there is no platform device 
with a hardware topology.
But platform devices are "real" hardware at most time. Smmu is a "real" device, 
but it is a platform device in Linux.

> 
> thanks,
> 
> greg k-h

-barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH] driver core: platform: expose numa_node to users in sysfs

2020-06-01 Thread Song Bao Hua (Barry Song)
> >
> > Platform devices are NUMA?  That's crazy, and feels like a total abuse
> > of platform devices and drivers that really should belong on a "real"
> > bus.
> 
> I am not sure if it is an abuse of platform device. But smmu is a platform
> device,
> drivers/iommu/arm-smmu-v3.c is a platform driver.
> In a typical ARM server, there are maybe multiple SMMU devices which can
> support
> IO virtual address and page tables for other devices on PCI-like busses.
> Each different SMMU device might be close to different NUMA node. There is
> really a hardware topology.
> 
> If you have multiple CPU packages in a NUMA server, some platform devices
> might
> Belong to CPU0, some other might belong to CPU1.

Those devices are populated by acpi_iort for an ARM server:

drivers/acpi/arm64/iort.c:

static const struct iort_dev_config iort_arm_smmu_v3_cfg __initconst = {
.name = "arm-smmu-v3",
.dev_dma_configure = arm_smmu_v3_dma_configure,
.dev_count_resources = arm_smmu_v3_count_resources,
.dev_init_resources = arm_smmu_v3_init_resources,
.dev_set_proximity = arm_smmu_v3_set_proximity,
};

void __init acpi_iort_init(void)
{
acpi_status status;

status = acpi_get_table(ACPI_SIG_IORT, 0, _table);
...
iort_check_id_count_workaround(iort_table);
iort_init_platform_devices();
}

static void __init iort_init_platform_devices(void)
{
...

for (i = 0; i < iort->node_count; i++) {
if (iort_node >= iort_end) {
pr_err("iort node pointer overflows, bad table\n");
return;
}

iort_enable_acs(iort_node);

ops = iort_get_dev_cfg(iort_node);
if (ops) {
fwnode = acpi_alloc_fwnode_static();
if (!fwnode)
return;

iort_set_fwnode(iort_node, fwnode);

ret = iort_add_platform_device(iort_node, ops);
if (ret) {
iort_delete_fwnode(iort_node);
acpi_free_fwnode_static(fwnode);
return;
}
}

...
}
...
}

NUMA node is got from ACPI:

static int  __init arm_smmu_v3_set_proximity(struct device *dev,
  struct acpi_iort_node *node)
{
struct acpi_iort_smmu_v3 *smmu;

smmu = (struct acpi_iort_smmu_v3 *)node->node_data;
if (smmu->flags & ACPI_IORT_SMMU_V3_PXM_VALID) {
int dev_node = acpi_map_pxm_to_node(smmu->pxm);

...

set_dev_node(dev, dev_node);
...
}
return 0;
}

Barry


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH] driver core: platform: expose numa_node to users in sysfs

2020-06-01 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Greg KH [mailto:gre...@linuxfoundation.org]
> Sent: Tuesday, June 2, 2020 4:24 PM
> To: Song Bao Hua (Barry Song) 
> Cc: raf...@kernel.org; iommu@lists.linux-foundation.org;
> linux-arm-ker...@lists.infradead.org; linux-ker...@vger.kernel.org; Linuxarm
> ; Zengtao (B) ; Robin
> Murphy 
> Subject: Re: [PATCH] driver core: platform: expose numa_node to users in sysfs
> 
> On Tue, Jun 02, 2020 at 03:01:39PM +1200, Barry Song wrote:
> > For some platform devices like iommu, particually ARM smmu, users may
> > care about the numa locality. for example, if threads and drivers run
> > near iommu, they may get much better performance on dma_unmap_sg.
> > For other platform devices, users may still want to know the hardware
> > topology.
> >
> > Cc: Prime Zeng 
> > Cc: Robin Murphy 
> > Signed-off-by: Barry Song 
> > ---
> >  drivers/base/platform.c | 26 +-
> >  1 file changed, 25 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/base/platform.c b/drivers/base/platform.c
> > index b27d0f6c18c9..7794b9a38d82 100644
> > --- a/drivers/base/platform.c
> > +++ b/drivers/base/platform.c
> > @@ -1062,13 +1062,37 @@ static ssize_t driver_override_show(struct
> device *dev,
> >  }
> >  static DEVICE_ATTR_RW(driver_override);
> >
> > +static ssize_t numa_node_show(struct device *dev,
> > +   struct device_attribute *attr, char *buf)
> > +{
> > +   return sprintf(buf, "%d\n", dev_to_node(dev));
> > +}
> > +static DEVICE_ATTR_RO(numa_node);
> > +
> > +static umode_t platform_dev_attrs_visible(struct kobject *kobj, struct
> attribute *a,
> > +   int n)
> > +{
> > +   struct device *dev = container_of(kobj, typeof(*dev), kobj);
> > +
> > +   if (a == _attr_numa_node.attr &&
> > +   dev_to_node(dev) == NUMA_NO_NODE)
> > +   return 0;
> > +
> > +   return a->mode;
> > +}
> >
> >  static struct attribute *platform_dev_attrs[] = {
> > _attr_modalias.attr,
> > +   _attr_numa_node.attr,
> > _attr_driver_override.attr,
> > NULL,
> >  };
> > -ATTRIBUTE_GROUPS(platform_dev);
> > +
> > +static struct attribute_group platform_dev_group = {
> > +   .attrs = platform_dev_attrs,
> > +   .is_visible = platform_dev_attrs_visible,
> > +};
> > +__ATTRIBUTE_GROUPS(platform_dev);
> >
> >  static int platform_uevent(struct device *dev, struct kobj_uevent_env *env)
> >  {
> 
> Platform devices are NUMA?  That's crazy, and feels like a total abuse
> of platform devices and drivers that really should belong on a "real"
> bus.

I am not sure if it is an abuse of platform device. But smmu is a platform 
device,
drivers/iommu/arm-smmu-v3.c is a platform driver.
In a typical ARM server, there are maybe multiple SMMU devices which can support
IO virtual address and page tables for other devices on PCI-like busses.
Each different SMMU device might be close to different NUMA node. There is
really a hardware topology.

If you have multiple CPU packages in a NUMA server, some platform devices might
Belong to CPU0, some other might belong to CPU1.

-barry

> 
> What drivers in the kernel today have this issue?
> 
> thanks,
> 
> greg k-h


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH] iommu/arm-smmu-v3: expose numa_node attribute to users in sysfs

2020-06-01 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Robin Murphy [mailto:robin.mur...@arm.com]
> Sent: Tuesday, June 2, 2020 1:14 AM
> To: Song Bao Hua (Barry Song) ; w...@kernel.org;
> h...@lst.de; m.szyprow...@samsung.com; iommu@lists.linux-foundation.org
> Cc: Linuxarm ; linux-arm-ker...@lists.infradead.org
> Subject: Re: [PATCH] iommu/arm-smmu-v3: expose numa_node attribute to
> users in sysfs
> 
> On 2020-05-30 10:15, Barry Song wrote:
> > As tests show the latency of dma_unmap can increase dramatically while
> > calling them cross NUMA nodes, especially cross CPU packages, eg.
> > 300ns vs 800ns while waiting for the completion of CMD_SYNC in an
> > empty command queue. The large latency causing by remote node will
> > in turn make contention of the command queue more serious, and enlarge
> > the latency of DMA users within local NUMA nodes.
> >
> > Users might intend to enforce NUMA locality with the consideration of
> > the position of SMMU. The patch provides minor benefit by presenting
> > this information to users directly, as they might want to know it without
> > checking hardware spec at all.
> 
> Hmm, given that dev-to_node() is a standard driver model thing, is there
> not already some generic device property that can expose it - and if
> not, should there be? Presumably if userspace cares enough to want to
> know whereabouts in the system an IOMMU is, it probably also cares where
> the actual endpoint devices are too.
> 
> At the very least, it doesn't seem right for it to be specific to one
> single IOMMU driver.

Right now pci devices have generally got the numa_node in sysfs by 
drivers/pci/pci-sysfs.c

static ssize_t numa_node_store(struct device *dev,
   struct device_attribute *attr, const char *buf,
   size_t count)
{
...

add_taint(TAINT_FIRMWARE_WORKAROUND, LOCKDEP_STILL_OK);
pci_alert(pdev, FW_BUG "Overriding NUMA node to %d.  Contact your 
vendor for updates.",
  node);

dev->numa_node = node;
return count;
}

static ssize_t numa_node_show(struct device *dev, struct device_attribute *attr,
  char *buf)
{
return sprintf(buf, "%d\n", dev->numa_node);
}
static DEVICE_ATTR_RW(numa_node);

for other devices who care about numa information, the specific drivers are 
doing that, for example:

drivers/dax/bus.c:  if (a == _attr_numa_node.attr && 
!IS_ENABLED(CONFIG_NUMA))
drivers/dax/bus.c:  _attr_numa_node.attr,
drivers/dma/idxd/sysfs.c:   _attr_numa_node.attr,
drivers/hv/vmbus_drv.c: _attr_numa_node.attr,
drivers/nvdimm/bus.c:   _attr_numa_node.attr,
drivers/nvme/host/core.c:   _attr_numa_node.attr,

smmu is usually a platform device, we can actually expose numa_node for 
platform_device, or globally expose numa_node
for general "device" if people don't opposite.

Barry

> 
> Robin.
> 
> > Signed-off-by: Barry Song 
> > ---
> >   drivers/iommu/arm-smmu-v3.c | 40
> -
> >   1 file changed, 39 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> > index 82508730feb7..754c4d59498b 100644
> > --- a/drivers/iommu/arm-smmu-v3.c
> > +++ b/drivers/iommu/arm-smmu-v3.c
> > @@ -4021,6 +4021,44 @@ err_reset_pci_ops: __maybe_unused;
> > return err;
> >   }
> >
> > +static ssize_t numa_node_show(struct device *dev,
> > +   struct device_attribute *attr, char *buf)
> > +{
> > +   return sprintf(buf, "%d\n", dev_to_node(dev));
> > +}
> > +static DEVICE_ATTR_RO(numa_node);
> > +
> > +static umode_t arm_smmu_numa_attr_visible(struct kobject *kobj, struct
> attribute *a,
> > +   int n)
> > +{
> > +   struct device *dev = container_of(kobj, typeof(*dev), kobj);
> > +
> > +   if (!IS_ENABLED(CONFIG_NUMA))
> > +   return 0;
> > +
> > +   if (a == _attr_numa_node.attr &&
> > +   dev_to_node(dev) == NUMA_NO_NODE)
> > +   return 0;
> > +
> > +   return a->mode;
> > +}
> > +
> > +static struct attribute *arm_smmu_dev_attrs[] = {
> > +   _attr_numa_node.attr,
> > +   NULL
> > +};
> > +
> > +static struct attribute_group arm_smmu_dev_attrs_group = {
> > +   .attrs  = arm_smmu_dev_attrs,
> > +   .is_visible = arm_smmu_numa_attr_visible,
> > +};
> > +
> > +
> > +static const struct attribute_group *arm_smmu_dev_attrs_groups[] = {
> > +   _smmu_dev_attrs_group,
> > +   NULL,
> >

[PATCH] iommu/arm-smmu-v3: allocate the memory of queues in local numa node

2020-06-01 Thread Song Bao Hua (Barry Song)
BEGIN:VCALENDAR
METHOD:REQUEST
PRODID:Microsoft Exchange Server 2010
VERSION:2.0
BEGIN:VTIMEZONE
TZID:New Zealand Standard Time
BEGIN:STANDARD
DTSTART:16010101T03
TZOFFSETFROM:+1300
TZOFFSETTO:+1200
RRULE:FREQ=YEARLY;INTERVAL=1;BYDAY=1SU;BYMONTH=4
END:STANDARD
BEGIN:DAYLIGHT
DTSTART:16010101T02
TZOFFSETFROM:+1200
TZOFFSETTO:+1300
RRULE:FREQ=YEARLY;INTERVAL=1;BYDAY=-1SU;BYMONTH=9
END:DAYLIGHT
END:VTIMEZONE
BEGIN:VEVENT
ORGANIZER;CN=Song Bao Hua (Barry Song):MAILTO:song.bao@hisilicon.com
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=h...@lst.de
 :MAILTO:h...@lst.de
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=m.szyprows
 k...@samsung.com:MAILTO:m.szyprow...@samsung.com
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=robin.murp
 h...@arm.com:MAILTO:robin.mur...@arm.com
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=will@kerne
 l.org:MAILTO:w...@kernel.org
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=linux-arm-
 ker...@lists.infradead.org:MAILTO:linux-arm-ker...@lists.infradead.org
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=iommu@list
 s.linux-foundation.org:MAILTO:iommu@lists.linux-foundation.org
ATTENDEE;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=Linuxarm:M
 AILTO:linux...@huawei.com
DESCRIPTION;LANGUAGE=en-US:> From: Song Bao Hua (Barry Song)\n> Sent: Monda
 y\, June 1\, 2020 11:32 PM\n> To: h...@lst.de\; m.szyprow...@samsung.com\; 
 robin.mur...@arm.com\; \n> w...@kernel.org\n> Cc: linux-arm-kernel@lists.i
 nfradead.org\; \n> iommu@lists.linux-foundation.org\; Linuxarm \; Song \n> Bao Hua (Barry Song) \n>
  Subject: [PATCH] iommu/arm-smmu-v3: allocate the memory of queues in \n> 
 local numa node\n> \n> \n> dmam_alloc_coherent() will usually allocate mem
 ory from the default CMA.\n> For\n> a common arm64 defconfig without reser
 ved memory in device tree\, there \n> is only one CMA close to address 0.\
 n> dma_alloc_contiguous() will allocate memory without any idea of  NUMA \
 n> and smmu has no customized per-numa cma_area.\n> struct page *dma_alloc
 _contiguous(struct device *dev\, size_t size\, \n> gfp_t gfp) {\n>
  size_t count = size >> PAGE_SHIFT\;\n> struct page *page = NULL\;
 \n> struct cma *cma = NULL\;\n> \n> if (dev && dev->cma_ar
 ea)\n> cma = dev->cma_area\;\n> else if (count > 1
 )\n> cma = dma_contiguous_default_area\;\n> \n> 	...\n>   
   return page\;\n> }\n> \n> if there are N numa nodes\, N-1 nodes will
  put command/evt queues etc \n> in a remote node the default CMA belongs t
 o\,probably node 0. Tests \n> show\, after sending CMD_SYNC in an empty co
 mmand queue\, on Node2\, \n> without this patch\, it takes 550ns to wait f
 or the completion of \n> CMD_SYNC\; with this patch\, it takes 250ns to wa
 it for the completion \n> of CMD_SYNC.\n> \n\nSorry for missing the RFC in
  the subject.\nFor the tested platform\, hardware will help the sync betwe
 en cpu and smmu. So there is no sync operations. So please consider this p
 atch as a RFC. This is only for the concept.\n\nThe other option to fix th
 is is creating a per-numa CMA area for smmu and assigning the per-numa cma
 _area to SMMU.\nMaybe cma_declare_contiguous_nid() used by mm/hugetlb.c ca
 n be used by AARCH64/SMMU.\n\nOr we can completely change CMA to create de
 fault per-numa CMA like this:\n\n struct page *dma_alloc_contiguous(struct
  device *dev\, size_t size\, gfp_t gfp)  {\n size_t count = size >
 > PAGE_SHIFT\;\n struct page *page = NULL\;\n struct cma *
 cma = NULL\;\n +int nid = dev_to_node(dev)\;\n \n if (dev 
 && dev->cma_area)\n cma = dev->cma_area\;\n else i
 f (count > 1)\n -   cma = dma_contiguous_default_area\;\n +   
  cma = dma_contiguous_default_areas[nid]\;\n \n 	...\n
  return page\;\n}\n\nThen there is no necessity to assign per-numa cma_are
 a to smmu->dev.cma_area.\n\n-barry\n\n> Signed-off-by: Barry Song \n> ---\n>  drivers/iommu/arm-smmu-v3.c | 63 \n> +
 +++-\n>  1 file changed\, 48 insertions(+)\, 1
 5 deletions(-)\n> \n> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/i
 ommu/arm-smmu-v3.c \n> index 82508730feb7..58295423e1d7 100644\n> --- a/dr
 ivers/iommu/arm-smmu-v3.c\n> +++ b/drivers/iommu/arm-smmu-v3.c\n> @@ -3157
 \,21 +3157\,23 @@ static int arm_smmu_init_one_queue(struct \n> arm_smmu_d
 evice *smmu\,\n>     size_t dwords\, const char *name)  {\n>  	size_t 
 qsz\;\n> +	struct page *page\;\n> \n> -	do {\n> -		qsz = ((1 << q->llq.max
 _n_shift) * dwords) << 3\;\n> -		q->base = dmam_alloc_coh

RE: arm-smmu-v3 high cpu usage for NVMe

2020-05-24 Thread Song Bao Hua (Barry Song)
> Subject: Re: arm-smmu-v3 high cpu usage for NVMe
> 
> On 20/03/2020 10:41, John Garry wrote:
> 
> + Barry, Alexandru
> 
> >     PerfTop:   85864 irqs/sec  kernel:89.6%  exact:  0.0% lost:
> > 0/34434 drop:
> > 0/40116 [4000Hz cycles],  (all, 96 CPUs)
> >
> 
> --
> >
> >
> >   27.43%  [kernel]  [k]
> arm_smmu_cmdq_issue_cmdlist
> >   11.71%  [kernel]  [k]
> _raw_spin_unlock_irqrestore
> >    6.35%  [kernel]  [k] _raw_spin_unlock_irq
> >    2.65%  [kernel]  [k] get_user_pages_fast
> >    2.03%  [kernel]  [k] __slab_free
> >    1.55%  [kernel]  [k] tick_nohz_idle_exit
> >    1.47%  [kernel]  [k] arm_lpae_map
> >    1.39%  [kernel]  [k] __fget
> >    1.14%  [kernel]  [k] __lock_text_start
> >    1.09%  [kernel]  [k] _raw_spin_lock
> >    1.08%  [kernel]  [k] bio_release_pages.part.42
> >    1.03%  [kernel]  [k] __sbitmap_get_word
> >    0.97%  [kernel]  [k]
> > arm_smmu_atc_inv_domain.constprop.42
> >    0.91%  [kernel]  [k] fput_many
> >    0.88%  [kernel]  [k] __arm_lpae_map
> >
> 
> Hi Will, Robin,
> 
> I'm just getting around to look at this topic again. Here's the current
> picture for my NVMe test:
> 
> perf top -C 0 *
> Samples: 808 of event 'cycles:ppp', Event count (approx.): 469909024
> Overhead Shared Object Symbol
> 75.91% [kernel] [k] arm_smmu_cmdq_issue_cmdlist
> 3.28% [kernel] [k] arm_smmu_tlb_inv_range
> 2.42% [kernel] [k] arm_smmu_atc_inv_domain.constprop.49
> 2.35% [kernel] [k] _raw_spin_unlock_irqrestore
> 1.32% [kernel] [k] __arm_smmu_cmdq_poll_set_valid_map.isra.41
> 1.20% [kernel] [k] aio_complete_rw
> 0.96% [kernel] [k] enqueue_task_fair
> 0.93% [kernel] [k] gic_handle_irq
> 0.86% [kernel] [k] _raw_spin_lock_irqsave
> 0.72% [kernel] [k] put_reqs_available
> 0.72% [kernel] [k] sbitmap_queue_clear
> 
> * only certain CPUs run the dma unmap for my scenario, cpu0 being one of
> them.
> 
> Colleague Barry has similar findings for some other scenarios.

I wrote a test module and use the parameter "ways" to simulate how busy SMMU is 
and
compare the latency under different degrees of contentions.
1.  static int ways=16;  
2.  module_param(ways, int, S_IRUGO);  
3.
4.  static int seconds=120;  
5.  module_param(seconds, int, S_IRUGO);  
6.
7.  extern struct device *get_zip_dev(void);  
8.
9.  static noinline void test_mapsingle(struct device *dev, void *buf, int 
size)  
10. {  
11. dma_addr_t dma_addr = dma_map_single(dev, buf, size, 
DMA_TO_DEVICE);  
12. dma_unmap_single(dev, dma_addr, size, DMA_TO_DEVICE);  
13. }  
14.   
15. static noinline void test_memcpy(void *out, void *in, int size)  
16. {  
17. memcpy(out, in, size);  
18. }  
19.   
20. static int testthread(void *data)  
21. {  
22. unsigned long stop = jiffies +seconds*HZ;  
23. struct device *dev = get_zip_dev();  
24.   
25. char *input = kzalloc(4096, GFP_KERNEL);  
26. if (!input)  
27. return -ENOMEM;  
28.   
29. char *output = kzalloc(4096, GFP_KERNEL);  
30. if (!output)  
31. return -ENOMEM;  
32.   
33. while (time_before(jiffies, stop)) {  
34. test_mapsingle(dev, input, 4096);  
35. test_memcpy(output, input, 4096);  
36. }  
37.   
38. kfree(output);  
39. kfree(input);  
40.   
41. return 0;  
42. }  
43.   
44. static int __init test_init(void)  
45. {  
46. struct task_struct *tsk;  
47. int i;  
50.   
51. for(i=0;iq.llq.val, llq.val, head.val)

when ways=64, more than 60% time will be on:
cmpxchg_relaxed(>q.llq.val, llq.val, head.val)

here is a table for dma_unmap, arm_smmu_cmdq_issue_cmdlist() and CMD_SYNC with 
different ways:
 whole unmap(ns)   arm_smmu_cmdq_issue_cmdlist()ns  wait 
CMD_SYNC(ns) 
Ways=1 19561328 883  
Ways=1688917474 4000
Ways=3222043   195196879
Ways=6460842   5589516746 
Ways=96101880  9364924429

As you can see, while ways=1, we still need 2us to unmap, and 
arm_smmu_cmdq_issue_cmdlist() takes 60% time of the dma_unmap, CMD_SNC
takes more than 60% time of arm_smmu_cmdq_issue_cmdlist().

When SMMU is very busy, dma_unmap latency can be very large due to contention, 
more than 100us.


Thanks
Barry

> 
> So we tried the latest perf NMI