RE: [PATCH 1/2] dma-mapping: add benchmark support for streaming DMA APIs

2020-10-31 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Song Bao Hua (Barry Song) [mailto:song.bao@hisilicon.com]
> Sent: Saturday, October 31, 2020 10:45 PM
> To: Robin Murphy ;
> iommu@lists.linux-foundation.org; h...@lst.de; m.szyprow...@samsung.com
> Cc: j...@8bytes.org; w...@kernel.org; sh...@kernel.org; Linuxarm
> ; linux-kselft...@vger.kernel.org
> Subject: RE: [PATCH 1/2] dma-mapping: add benchmark support for streaming
> DMA APIs
> 
> 
> 
> > -Original Message-
> > From: Robin Murphy [mailto:robin.mur...@arm.com]
> > Sent: Saturday, October 31, 2020 4:48 AM
> > To: Song Bao Hua (Barry Song) ;
> > iommu@lists.linux-foundation.org; h...@lst.de; m.szyprow...@samsung.com
> > Cc: j...@8bytes.org; w...@kernel.org; sh...@kernel.org; Linuxarm
> > ; linux-kselft...@vger.kernel.org
> > Subject: Re: [PATCH 1/2] dma-mapping: add benchmark support for
> streaming
> > DMA APIs
> >
> > On 2020-10-29 21:39, Song Bao Hua (Barry Song) wrote:
> > [...]
> > >>> +struct map_benchmark {
> > >>> +   __u64 map_nsec;
> > >>> +   __u64 unmap_nsec;
> > >>> +   __u32 threads; /* how many threads will do map/unmap in parallel
> > */
> > >>> +   __u32 seconds; /* how long the test will last */
> > >>> +   int node; /* which numa node this benchmark will run on */
> > >>> +   __u64 expansion[10];/* For future use */
> > >>> +};
> > >>
> > >> I'm no expert on userspace ABIs (and what little experience I do have
> > >> is mostly of Win32...), so hopefully someone else will comment if
> > >> there's anything of concern here. One thing I wonder is that there's
> > >> a fair likelihood of functionality evolving here over time, so might
> > >> it be appropriate to have some sort of explicit versioning parameter
> > >> for robustness?
> > >
> > > I copied that from gup_benchmark. There is no this kind of code to
> > > compare version.
> > > I believe there is a likelihood that kernel module is changed but
> > > users are still using old userspace tool, this might lead to the
> > > incompatible data structure.
> > > But not sure if it is a big problem :-)
> >
> > Yeah, like I say I don't really have a good feeling for what would be best 
> > here,
> > I'm just thinking of what I do know and wary of the potential for a "640 
> > bits
> > ought to be enough for anyone" issue ;)
> >
> > >>> +struct map_benchmark_data {
> > >>> +   struct map_benchmark bparam;
> > >>> +   struct device *dev;
> > >>> +   struct dentry  *debugfs;
> > >>> +   atomic64_t total_map_nsecs;
> > >>> +   atomic64_t total_map_loops;
> > >>> +   atomic64_t total_unmap_nsecs;
> > >>> +   atomic64_t total_unmap_loops;
> > >>> +};
> > >>> +
> > >>> +static int map_benchmark_thread(void *data) {
> > >>> +   struct page *page;
> > >>> +   dma_addr_t dma_addr;
> > >>> +   struct map_benchmark_data *map = data;
> > >>> +   int ret = 0;
> > >>> +
> > >>> +   page = alloc_page(GFP_KERNEL);
> > >>> +   if (!page)
> > >>> +   return -ENOMEM;
> > >>> +
> > >>> +   while (!kthread_should_stop())  {
> > >>> +   ktime_t map_stime, map_etime, unmap_stime, unmap_etime;
> > >>> +
> > >>> +   map_stime = ktime_get();
> > >>> +   dma_addr = dma_map_page(map->dev, page, 0, PAGE_SIZE,
> > >> DMA_BIDIRECTIONAL);
> > >>
> > >> Note that for a non-coherent device, this will give an underestimate
> > >> of the real-world overhead of BIDIRECTIONAL or TO_DEVICE mappings,
> > >> since the page will never be dirty in the cache (except possibly the
> > >> very first time through).
> > >
> > > Agreed. I'd like to add a DIRECTION parameter like "-d 0", "-d 1"
> > > after we have this basic framework.
> >
> > That wasn't so much about the direction itself, just that if it's anything 
> > other
> > than FROM_DEVICE, we should probably do something to dirty the buffer by
> a
> > reasonable amount before each map. Otherwise the measured performance
> is
> > going to be unrealistic on many systems.
> 
> Maybe put a memset(buf, 0, PAGE_SIZE) before dma_map will help ?
> 
> >
> > [...]
> > >>> +   atomic64_add((long 
> > >>> long)ktime_to_ns(ktime_sub(unmap_etime,
> > >> unmap_stime)),
> > >>> +   >total_unmap_nsecs);
> > >>> +   atomic64_inc(>total_map_loops);
> > >>> +   atomic64_inc(>total_unmap_loops);
> > >>
> > >> I think it would be worth keeping track of the variances as well - it
> > >> can be hard to tell if a reasonable-looking average is hiding
> > >> terrible worst-case behaviour.
> > >
> > > This is a sensible requirement. I believe it is better to be handled
> > > by the existing kernel tracing method.
> > >
> > > Maybe we need a histogram like:
> > > Delay   sample count
> > > 1-2us   1000  ***
> > > 2-3us   2000  ***
> > > 3-4us   100   *
> > > .
> > > This will be more precise than the maximum latency in the worst case.
> > >
> > > I'd believe this can be 

Re: [GIT PULL] dma-mapping fix for 5.10

2020-10-31 Thread pr-tracker-bot
The pull request you sent on Sat, 31 Oct 2020 10:38:23 +0100:

> git://git.infradead.org/users/hch/dma-mapping.git tags/dma-mapping-5.10-2

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/bb3540be73ca1e483aa977d859960895fe85372d

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [GIT PULL] dma-mapping fix for 5.10

2020-10-31 Thread Linus Torvalds
On Sat, Oct 31, 2020 at 2:40 AM Christoph Hellwig  wrote:
>
> dma-mapping fix for 5.10:
>
>  - fix an integer overflow on 32-bit platforms in the new DMA range code
>(Geert Uytterhoeven)

So this is just a stylistic nit, and has no impact on this pull (which
I've done). But looking at the patch, it triggers one of my "this is
wrong" patterns.

In particular, this:

u64 dma_start = 0;
...
for (dma_start = ~0ULL; r->size; r++) {

is actually completely bogus in theory, and it's a horribly horribly
bad pattern to have.

The thing that I hate about that parttern is "~0ULL", which is simply wrong.

The correct pattern for "all bits set" is ~0. NOTHING ELSE. No extra
letters at the end.

Why? Because using an unsigned type is wrong, and will not extend the
bits up to a potentially bigger size.

So adding that "ULL" is not just three extra characters to type, it
actually _detracts_ from the code and makes it more fragile and
potentially wrong.

It so happens, that yes, in the kernel, "ull" us 64-bit, and you get
the right results. So the code _works_. But it's wrong, and it now
requires that the types match exactly (ie it would not be broken if
somebody ever were to say "I want to use use 128-bit dma addresses and
u128").

Another example is using "~0ul", which would give different results on
a 32-bit kernel and a 64-bit kernel. Again: DON'T DO THAT.

I repeat: the right thing to do for "all bits set" is just a plain ~0
or -1. Either of those are fine (technically assumes a 2's complement
machine, but let's just be honest: that's a perfectly fine assumption,
and -1 might be preferred by some because it makes that sign extension
behavior of the integer constant more obvious).

Don't try to do anything clever or anything else, because it's going
to be strictly worse.

The old code that that patch removed was "technically correct", but
just pointless, and actually shows the problem:

for (dma_start = ~(dma_addr_t)0; r->size; r++) {

the above is indeed a correct way to say "I want all bits set in a
dma_addr_t", but while correct, it is - once again - strictly inferior
to just using "~0".

Why? Because "~0" works regardless of type. IOW, exactly *because*
people used the wrong pattern for "all bits set", that patch was now
(a) bigger than necessary and (b) much more ilkely to cause bugs (ie I
could have imagined people changing just the type of the variable
without changing the initialization).

So in that tiny three-line patch there were actually several examples
of why "~0" is the right pattern to use for "all bits set". Because it
JustWorks(tm) in ways other patterns do not.

And if you have a compiler that complains about assigning -1 or ~0 to
an unsigned variable, get rid of that piece of garbage. You're almost
certainly either using some warning flag that you shouldn't be using,
or the compiler writer didn't know what they were doing.

Linus
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 1/3] dt-bindings: memory: mediatek: Convert SMI to DT schema

2020-10-31 Thread Krzysztof Kozlowski
On Fri, Oct 30, 2020 at 05:12:52PM +0800, Yong Wu wrote:
> Convert MediaTek SMI to DT schema.
> 
> CC: Fabien Parent 
> CC: Ming-Fan Chen 
> CC: Matthias Brugger 
> Signed-off-by: Yong Wu 
> ---
>  .../mediatek,smi-common.txt   |  50 ---
>  .../mediatek,smi-common.yaml  | 140 ++
>  .../memory-controllers/mediatek,smi-larb.txt  |  50 ---
>  .../memory-controllers/mediatek,smi-larb.yaml | 129 
>  4 files changed, 269 insertions(+), 100 deletions(-)
>  delete mode 100644 
> Documentation/devicetree/bindings/memory-controllers/mediatek,smi-common.txt

+Cc Honghui Zhang,

Your Ack is needed as you contributed descriptions to the bindings and
work is being relicensed to GPL-2.0-only OR BSD-2-Clause.


Best regards,
Krzysztof




>  create mode 100644 
> Documentation/devicetree/bindings/memory-controllers/mediatek,smi-common.yaml
>  delete mode 100644 
> Documentation/devicetree/bindings/memory-controllers/mediatek,smi-larb.txt
>  create mode 100644 
> Documentation/devicetree/bindings/memory-controllers/mediatek,smi-larb.yaml
> 
> diff --git 
> a/Documentation/devicetree/bindings/memory-controllers/mediatek,smi-common.txt
>  
> b/Documentation/devicetree/bindings/memory-controllers/mediatek,smi-common.txt
> deleted file mode 100644
> index dbafffe3f41e..
> --- 
> a/Documentation/devicetree/bindings/memory-controllers/mediatek,smi-common.txt
> +++ /dev/null
> @@ -1,50 +0,0 @@
> -SMI (Smart Multimedia Interface) Common
> -
> -The hardware block diagram please check bindings/iommu/mediatek,iommu.txt
> -
> -Mediatek SMI have two generations of HW architecture, here is the list
> -which generation the SoCs use:
> -generation 1: mt2701 and mt7623.
> -generation 2: mt2712, mt6779, mt8167, mt8173 and mt8183.
> -
> -There's slight differences between the two SMI, for generation 2, the
> -register which control the iommu port is at each larb's register base. But
> -for generation 1, the register is at smi ao base(smi always on register
> -base). Besides that, the smi async clock should be prepared and enabled for
> -SMI generation 1 to transform the smi clock into emi clock domain, but that 
> is
> -not needed for SMI generation 2.
> -
> -Required properties:
> -- compatible : must be one of :
> - "mediatek,mt2701-smi-common"
> - "mediatek,mt2712-smi-common"
> - "mediatek,mt6779-smi-common"
> - "mediatek,mt7623-smi-common", "mediatek,mt2701-smi-common"
> - "mediatek,mt8167-smi-common"
> - "mediatek,mt8173-smi-common"
> - "mediatek,mt8183-smi-common"
> -- reg : the register and size of the SMI block.
> -- power-domains : a phandle to the power domain of this local arbiter.
> -- clocks : Must contain an entry for each entry in clock-names.
> -- clock-names : must contain 3 entries for generation 1 smi HW and 2 entries
> -  for generation 2 smi HW as follows:
> -  - "apb" : Advanced Peripheral Bus clock, It's the clock for setting
> - the register.
> -  - "smi" : It's the clock for transfer data and command.
> - They may be the same if both source clocks are the same.
> -  - "async" : asynchronous clock, it help transform the smi clock into the 
> emi
> -   clock domain, this clock is only needed by generation 1 smi HW.
> -  and these 2 option clocks for generation 2 smi HW:
> -  - "gals0": the path0 clock of GALS(Global Async Local Sync).
> -  - "gals1": the path1 clock of GALS(Global Async Local Sync).
> -  Here is the list which has this GALS: mt6779 and mt8183.
> -
> -Example:
> - smi_common: smi@14022000 {
> - compatible = "mediatek,mt8173-smi-common";
> - reg = <0 0x14022000 0 0x1000>;
> - power-domains = < MT8173_POWER_DOMAIN_MM>;
> - clocks = < CLK_MM_SMI_COMMON>,
> -  < CLK_MM_SMI_COMMON>;
> - clock-names = "apb", "smi";
> - };
> diff --git 
> a/Documentation/devicetree/bindings/memory-controllers/mediatek,smi-common.yaml
>  
> b/Documentation/devicetree/bindings/memory-controllers/mediatek,smi-common.yaml
> new file mode 100644
> index ..e050a0c2aed6
> --- /dev/null
> +++ 
> b/Documentation/devicetree/bindings/memory-controllers/mediatek,smi-common.yaml
> @@ -0,0 +1,140 @@
> +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
> +# Copyright (c) 2020 MediaTek Inc.
> +%YAML 1.2
> +---
> +$id: 
> http://devicetree.org/schemas/memory-controllers/mediatek,smi-common.yaml#
> +$schema: http://devicetree.org/meta-schemas/core.yaml#
> +
> +title: SMI (Smart Multimedia Interface) Common
> +
> +maintainers:
> +  - Yong Wu 
> +
> +description: |+
> +  The hardware block diagram please check bindings/iommu/mediatek,iommu.yaml
> +
> +  MediaTek SMI have two generations of HW architecture, here is the list
> +  which generation the SoCs use:
> +  generation 1: mt2701 and mt7623.
> +  generation 2: mt2712, mt6779, mt8167, mt8173 and mt8183.
> +
> +  There's slight differences between the 

RE: [PATCH 1/2] dma-mapping: add benchmark support for streaming DMA APIs

2020-10-31 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Robin Murphy [mailto:robin.mur...@arm.com]
> Sent: Saturday, October 31, 2020 4:48 AM
> To: Song Bao Hua (Barry Song) ;
> iommu@lists.linux-foundation.org; h...@lst.de; m.szyprow...@samsung.com
> Cc: j...@8bytes.org; w...@kernel.org; sh...@kernel.org; Linuxarm
> ; linux-kselft...@vger.kernel.org
> Subject: Re: [PATCH 1/2] dma-mapping: add benchmark support for streaming
> DMA APIs
> 
> On 2020-10-29 21:39, Song Bao Hua (Barry Song) wrote:
> [...]
> >>> +struct map_benchmark {
> >>> + __u64 map_nsec;
> >>> + __u64 unmap_nsec;
> >>> + __u32 threads; /* how many threads will do map/unmap in parallel
> */
> >>> + __u32 seconds; /* how long the test will last */
> >>> + int node; /* which numa node this benchmark will run on */
> >>> + __u64 expansion[10];/* For future use */
> >>> +};
> >>
> >> I'm no expert on userspace ABIs (and what little experience I do have
> >> is mostly of Win32...), so hopefully someone else will comment if
> >> there's anything of concern here. One thing I wonder is that there's
> >> a fair likelihood of functionality evolving here over time, so might
> >> it be appropriate to have some sort of explicit versioning parameter
> >> for robustness?
> >
> > I copied that from gup_benchmark. There is no this kind of code to
> > compare version.
> > I believe there is a likelihood that kernel module is changed but
> > users are still using old userspace tool, this might lead to the
> > incompatible data structure.
> > But not sure if it is a big problem :-)
> 
> Yeah, like I say I don't really have a good feeling for what would be best 
> here,
> I'm just thinking of what I do know and wary of the potential for a "640 bits
> ought to be enough for anyone" issue ;)
> 
> >>> +struct map_benchmark_data {
> >>> + struct map_benchmark bparam;
> >>> + struct device *dev;
> >>> + struct dentry  *debugfs;
> >>> + atomic64_t total_map_nsecs;
> >>> + atomic64_t total_map_loops;
> >>> + atomic64_t total_unmap_nsecs;
> >>> + atomic64_t total_unmap_loops;
> >>> +};
> >>> +
> >>> +static int map_benchmark_thread(void *data) {
> >>> + struct page *page;
> >>> + dma_addr_t dma_addr;
> >>> + struct map_benchmark_data *map = data;
> >>> + int ret = 0;
> >>> +
> >>> + page = alloc_page(GFP_KERNEL);
> >>> + if (!page)
> >>> + return -ENOMEM;
> >>> +
> >>> + while (!kthread_should_stop())  {
> >>> + ktime_t map_stime, map_etime, unmap_stime, unmap_etime;
> >>> +
> >>> + map_stime = ktime_get();
> >>> + dma_addr = dma_map_page(map->dev, page, 0, PAGE_SIZE,
> >> DMA_BIDIRECTIONAL);
> >>
> >> Note that for a non-coherent device, this will give an underestimate
> >> of the real-world overhead of BIDIRECTIONAL or TO_DEVICE mappings,
> >> since the page will never be dirty in the cache (except possibly the
> >> very first time through).
> >
> > Agreed. I'd like to add a DIRECTION parameter like "-d 0", "-d 1"
> > after we have this basic framework.
> 
> That wasn't so much about the direction itself, just that if it's anything 
> other
> than FROM_DEVICE, we should probably do something to dirty the buffer by a
> reasonable amount before each map. Otherwise the measured performance is
> going to be unrealistic on many systems.

Maybe put a memset(buf, 0, PAGE_SIZE) before dma_map will help ?

> 
> [...]
> >>> + atomic64_add((long long)ktime_to_ns(ktime_sub(unmap_etime,
> >> unmap_stime)),
> >>> + >total_unmap_nsecs);
> >>> + atomic64_inc(>total_map_loops);
> >>> + atomic64_inc(>total_unmap_loops);
> >>
> >> I think it would be worth keeping track of the variances as well - it
> >> can be hard to tell if a reasonable-looking average is hiding
> >> terrible worst-case behaviour.
> >
> > This is a sensible requirement. I believe it is better to be handled
> > by the existing kernel tracing method.
> >
> > Maybe we need a histogram like:
> > Delay   sample count
> > 1-2us   1000  ***
> > 2-3us   2000  ***
> > 3-4us   100   *
> > .
> > This will be more precise than the maximum latency in the worst case.
> >
> > I'd believe this can be handled by:
> > tracepoint  A
> > Map
> > Tracepoint  B
> >
> > Tracepoint   C
> > Unmap
> > Tracepoint   D
> >
> > Let the userspace ebpf to draw the histogram for the delta of B-A and D-C.
> >
> > So I am planning to put this requirement into todo list and write an
> > userspace ebpf/bcc script for histogram and put in tools/ directory.
> >
> > Please give your comments on this.
> 
> Right, I wasn't suggesting trying to homebrew a full data collection system 
> here
> - I agree there are better tools for that already - just that it's basically 
> free to
> track a sum of squares alongside a sum, so that we can trivially calculate a
> useful variance (or standard
> deviation) figure alongside the mean at the end.

For this case, I am not sure if it is true. Unless we expose more data such as
min, max etc. to userspace, it 

[GIT PULL] dma-mapping fix for 5.10

2020-10-31 Thread Christoph Hellwig
The following changes since commit ed8780e3f2ecc82645342d070c6b4e530532e680:

  Merge tag 'x86-urgent-2020-10-27' of 
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip (2020-10-27 14:39:29 
-0700)

are available in the Git repository at:

  git://git.infradead.org/users/hch/dma-mapping.git tags/dma-mapping-5.10-2

for you to fetch changes up to 48ab6d5d1f096d6fac5b59f94af0aa394115a001:

  dma-mapping: fix 32-bit overflow with CONFIG_ARM_LPAE=n (2020-10-29 16:59:34 
+0100)


dma-mapping fix for 5.10:

 - fix an integer overflow on 32-bit platforms in the new DMA range code
   (Geert Uytterhoeven)


Geert Uytterhoeven (1):
  dma-mapping: fix 32-bit overflow with CONFIG_ARM_LPAE=n

 drivers/of/device.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu