[PATCH RESEND] lib/scatterlist: Fix NULL pointer deference

2021-04-06 Thread Ricardo Ribalda
When sg_alloc_table_from_pages is called with n_pages = 0, we write in a
non-allocated page. Fix it by checking early the error condition.

[7.666801] BUG: kernel NULL pointer dereference, address: 0010
[7.667487] #PF: supervisor read access in kernel mode
[7.667970] #PF: error_code(0x) - not-present page
[7.668448] PGD 0 P4D 0
[7.668690] Oops:  [#1] SMP NOPTI
[7.669037] CPU: 0 PID: 184 Comm: modprobe Not tainted 5.11.0+ #2
[7.669606] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.14.0-2 04/01/2014
[7.670378] RIP: 0010:__sg_alloc_table_from_pages+0x2c5/0x4a0
[7.670924] Code: c9 01 48 c7 40 08 00 00 00 00 48 89 08 8b 47 0c 41 8d 44 
00 ff 89 47 0c 48 81 fa 00 f0 ff ff 0f 87 d4 01 00 00 49 8b 16 89 d8 <4a> 8b 74 
fd 00 4c 89 d1 44 29 f8 c1 e0 0c 44 29 d8 4c 39 d0 48 0f
[7.672643] RSP: 0018:ba1e8028fb30 EFLAGS: 00010287
[7.673133] RAX: 0001 RBX: 0001 RCX: 0002
[7.673791] RDX: 0002 RSI: ada6d0ba RDI: 9afe01fff820
[7.674448] RBP: 0010 R08: 0001 R09: 0001
[7.675100] R10:  R11:  R12: 
[7.675754] R13: f000 R14: 9afe01fff800 R15: 
[7.676409] FS:  7fb0f448f540() GS:9afe07a0() 
knlGS:
[7.677151] CS:  0010 DS:  ES:  CR0: 80050033
[7.677681] CR2: 0010 CR3: 02184001 CR4: 00370ef0
[7.678342] DR0:  DR1:  DR2: 
[7.679019] DR3:  DR6: fffe0ff0 DR7: 0400
[7.680349] Call Trace:
[7.680605]  ? device_add+0x146/0x810
[7.681021]  sg_alloc_table_from_pages+0x11/0x30
[7.681511]  vb2_dma_sg_alloc+0x162/0x280 [videobuf2_dma_sg]

Cc: sta...@vger.kernel.org
Fixes: efc42bc98058 ("scatterlist: add sg_alloc_table_from_pages function")
Signed-off-by: Ricardo Ribalda 
---
 lib/scatterlist.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index a59778946404..1e83b6a3d930 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -435,6 +435,9 @@ struct scatterlist *__sg_alloc_table_from_pages(struct 
sg_table *sgt,
unsigned int added_nents = 0;
struct scatterlist *s = prv;
 
+   if (n_pages == 0)
+   return ERR_PTR(-EINVAL);
+
/*
 * The algorithm below requires max_segment to be aligned to PAGE_SIZE
 * otherwise it can overshoot.
-- 
2.31.0.208.g409f899ff0-goog

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 6/6] media: uvcvideo: Use dma_alloc_noncontiguous API

2021-03-19 Thread Ricardo Ribalda
Hi Christoph

On Fri, Mar 19, 2021 at 2:10 PM Christoph Hellwig  wrote:
>
> On Fri, Mar 19, 2021 at 02:05:21PM +0100, Ricardo Ribalda wrote:
> > > +   uvc_urb->sgt,
> > > +   uvc_stream_dir(uvc_urb->stream));
> > > +   return usb_submit_urb(uvc_urb->urb, GFP_KERNEL);
> > > +}
> >
> > We should have mem_flags instead of GFP_KERNEL here
> >
> >
> > Is it too late to fix it in your tree? Do I need to send a patch to fix it?
>
> As far as I know we don't have anything that has pulled in the tree yet,
> so I can just fold the fix in.

Ohh good :). Thanks! and sorry again.



-- 
Ricardo Ribalda
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 6/6] media: uvcvideo: Use dma_alloc_noncontiguous API

2021-03-19 Thread Ricardo Ribalda
Hi Christoph

While backporting the patch I realised of a bug.

On Sat, Mar 13, 2021 at 12:55 AM Ricardo Ribalda  wrote:
>
> On architectures where there is no coherent caching such as ARM use the
> dma_alloc_noncontiguous API and handle manually the cache flushing using
> dma_sync_sgtable().
>
> If the architechture has coherent cache, the API falls back to
> alloc_dma_pages, so we can remove the coherent caching code-path from the
> driver, making it simpler.
>
> With this patch on the affected architectures we can measure up to 20x
> performance improvement in uvc_video_copy_data_work().
>
> Eg: aarch64 with an external usb camera
>
> NON_CONTIGUOUS
> frames:  999
> packets: 999
> empty:   0 (0 %)
> errors:  0
> invalid: 0
> pts: 0 early, 0 initial, 999 ok
> scr: 0 count ok, 0 diff ok
> sof: 2048 <= sof <= 0, freq 0.000 kHz
> bytes 67034480 : duration 33303
> FPS: 29.99
> URB: 523446/4993 uS/qty: 104.836 avg 132.532 std 13.230 min 831.094 max (uS)
> header: 76564/4993 uS/qty: 15.334 avg 15.229 std 3.438 min 186.875 max (uS)
> latency: 468945/4992 uS/qty: 93.939 avg 132.577 std 9.531 min 824.010 max (uS)
> decode: 54161/4993 uS/qty: 10.847 avg 6.313 std 1.614 min 111.458 max (uS)
> raw decode speed: 9.931 Gbits/s
> raw URB handling speed: 1.025 Gbits/s
> throughput: 16.102 Mbits/s
> URB decode CPU usage 0.162600 %
>
> COHERENT
> frames:  999
> packets: 999
> empty:   0 (0 %)
> errors:  0
> invalid: 0
> pts: 0 early, 0 initial, 999 ok
> scr: 0 count ok, 0 diff ok
> sof: 2048 <= sof <= 0, freq 0.000 kHz
> bytes 54683536 : duration 33302
> FPS: 29.99
> URB: 1478135/4000 uS/qty: 369.533 avg 390.357 std 22.968 min 3337.865 max (uS)
> header: 79761/4000 uS/qty: 19.940 avg 18.495 std 1.875 min 336.719 max (uS)
> latency: 281077/4000 uS/qty: 70.269 avg 83.102 std 5.104 min 735.000 max (uS)
> decode: 1197057/4000 uS/qty: 299.264 avg 318.080 std 1.615 min 2806.667 max 
> (uS)
> raw decode speed: 365.470 Mbits/s
> raw URB handling speed: 295.986 Mbits/s
> throughput: 13.136 Mbits/s
> URB decode CPU usage 3.594500 %
>
> In non-affected architectures we see no significant impact.
>
> Eg: x86 with an external usb camera
>
> NON_CONTIGUOUS
> frames:  999
> packets: 999
> empty:   0 (0 %)
> errors:  0
> invalid: 0
> pts: 0 early, 0 initial, 999 ok
> scr: 0 count ok, 0 diff ok
> sof: 2048 <= sof <= 0, freq 0.000 kHz
> bytes 70179056 : duration 33301
> FPS: 29.99
> URB: 288901/4897 uS/qty: 58.995 avg 26.022 std 4.319 min 253.853 max (uS)
> header: 54792/4897 uS/qty: 11.189 avg 6.218 std 0.620 min 61.750 max (uS)
> latency: 236602/4897 uS/qty: 48.315 avg 24.244 std 1.764 min 240.924 max (uS)
> decode: 52298/4897 uS/qty: 10.679 avg 8.299 std 1.638 min 108.861 max (uS)
> raw decode speed: 10.796 Gbits/s
> raw URB handling speed: 1.949 Gbits/s
> throughput: 16.859 Mbits/s
> URB decode CPU usage 0.157000 %
>
> COHERENT
> frames:  999
> packets: 999
> empty:   0 (0 %)
> errors:  0
> invalid: 0
> pts: 0 early, 0 initial, 999 ok
> scr: 0 count ok, 0 diff ok
> sof: 2048 <= sof <= 0, freq 0.000 kHz
> bytes 71818320 : duration 33301
> FPS: 29.99
> URB: 321021/5000 uS/qty: 64.204 avg 23.001 std 10.430 min 268.837 max (uS)
> header: 54308/5000 uS/qty: 10.861 avg 5.104 std 0.778 min 54.736 max (uS)
> latency: 268799/5000 uS/qty: 53.759 avg 21.827 std 6.095 min 255.153 max (uS)
> decode: 5/5000 uS/qty: 10.444 avg 7.137 std 1.874 min 71.103 max (uS)
> raw decode speed: 11.048 Gbits/s
> raw URB handling speed: 1.789 Gbits/s
> throughput: 17.253 Mbits/s
> URB decode CPU usage 0.156800 %
>
> Signed-off-by: Ricardo Ribalda 
> Reviewed-by: Laurent Pinchart 
> Reviewed-by: Tomasz Figa 
> Signed-off-by: Christoph Hellwig 
> ---
>
> Changelog from v3 (Thanks Laurent!):
>
> - Rename stream_dir and stream_to_dmadev to avoid collisions
> - Improve commit message
>
>  drivers/media/usb/uvc/uvc_video.c | 94 +++
>  drivers/media/usb/uvc/uvcvideo.h  |  5 +-
>  2 files changed, 73 insertions(+), 26 deletions(-)
>
> diff --git a/drivers/media/usb/uvc/uvc_video.c 
> b/drivers/media/usb/uvc/uvc_video.c
> index f2f565281e63..cdd8eb500bb7 100644
> --- a/drivers/media/usb/uvc/uvc_video.c
> +++ b/drivers/media/usb/uvc/uvc_video.c
> @@ -6,11 +6,14 @@
>   *  Laurent Pinchart (laurent.pinch...@ideasonboard.com)
>   */
>
> +#include 
> +#include 
>  #include 
>  #include 
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -1096,6 +1099,29 @@ static int uvc_video_decode_start(struct uvc_streaming 
> *stream,
> return data[0];
>  }
>
> 

Re: [PATCH v4 6/6] media: uvcvideo: Use dma_alloc_noncontiguous API

2021-03-15 Thread Ricardo Ribalda
On Mon, Mar 15, 2021 at 8:34 AM Christoph Hellwig  wrote:
>
> On Mon, Mar 15, 2021 at 08:30:57AM +0100, Ricardo Ribalda wrote:
> > Hi Christoph
> >
> > I guess you can merge this patch from your tree. I hope it is not too
> > late in this release cycle.
>
> The timing is perfectly fine, I haven't even started the dma-mapping tree
> for v5.13 yet.

Awesome!

Thanks Laurent for the review!


-- 
Ricardo Ribalda
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 6/6] media: uvcvideo: Use dma_alloc_noncontiguous API

2021-03-15 Thread Ricardo Ribalda
Hi Christoph

I guess you can merge this patch from your tree. I hope it is not too
late in this release cycle.

Have a great week!

On Sat, Mar 13, 2021 at 12:55 AM Ricardo Ribalda  wrote:
>
> On architectures where there is no coherent caching such as ARM use the
> dma_alloc_noncontiguous API and handle manually the cache flushing using
> dma_sync_sgtable().
>
> If the architechture has coherent cache, the API falls back to
> alloc_dma_pages, so we can remove the coherent caching code-path from the
> driver, making it simpler.
>
> With this patch on the affected architectures we can measure up to 20x
> performance improvement in uvc_video_copy_data_work().
>
> Eg: aarch64 with an external usb camera
>
> NON_CONTIGUOUS
> frames:  999
> packets: 999
> empty:   0 (0 %)
> errors:  0
> invalid: 0
> pts: 0 early, 0 initial, 999 ok
> scr: 0 count ok, 0 diff ok
> sof: 2048 <= sof <= 0, freq 0.000 kHz
> bytes 67034480 : duration 33303
> FPS: 29.99
> URB: 523446/4993 uS/qty: 104.836 avg 132.532 std 13.230 min 831.094 max (uS)
> header: 76564/4993 uS/qty: 15.334 avg 15.229 std 3.438 min 186.875 max (uS)
> latency: 468945/4992 uS/qty: 93.939 avg 132.577 std 9.531 min 824.010 max (uS)
> decode: 54161/4993 uS/qty: 10.847 avg 6.313 std 1.614 min 111.458 max (uS)
> raw decode speed: 9.931 Gbits/s
> raw URB handling speed: 1.025 Gbits/s
> throughput: 16.102 Mbits/s
> URB decode CPU usage 0.162600 %
>
> COHERENT
> frames:  999
> packets: 999
> empty:   0 (0 %)
> errors:  0
> invalid: 0
> pts: 0 early, 0 initial, 999 ok
> scr: 0 count ok, 0 diff ok
> sof: 2048 <= sof <= 0, freq 0.000 kHz
> bytes 54683536 : duration 33302
> FPS: 29.99
> URB: 1478135/4000 uS/qty: 369.533 avg 390.357 std 22.968 min 3337.865 max (uS)
> header: 79761/4000 uS/qty: 19.940 avg 18.495 std 1.875 min 336.719 max (uS)
> latency: 281077/4000 uS/qty: 70.269 avg 83.102 std 5.104 min 735.000 max (uS)
> decode: 1197057/4000 uS/qty: 299.264 avg 318.080 std 1.615 min 2806.667 max 
> (uS)
> raw decode speed: 365.470 Mbits/s
> raw URB handling speed: 295.986 Mbits/s
> throughput: 13.136 Mbits/s
> URB decode CPU usage 3.594500 %
>
> In non-affected architectures we see no significant impact.
>
> Eg: x86 with an external usb camera
>
> NON_CONTIGUOUS
> frames:  999
> packets: 999
> empty:   0 (0 %)
> errors:  0
> invalid: 0
> pts: 0 early, 0 initial, 999 ok
> scr: 0 count ok, 0 diff ok
> sof: 2048 <= sof <= 0, freq 0.000 kHz
> bytes 70179056 : duration 33301
> FPS: 29.99
> URB: 288901/4897 uS/qty: 58.995 avg 26.022 std 4.319 min 253.853 max (uS)
> header: 54792/4897 uS/qty: 11.189 avg 6.218 std 0.620 min 61.750 max (uS)
> latency: 236602/4897 uS/qty: 48.315 avg 24.244 std 1.764 min 240.924 max (uS)
> decode: 52298/4897 uS/qty: 10.679 avg 8.299 std 1.638 min 108.861 max (uS)
> raw decode speed: 10.796 Gbits/s
> raw URB handling speed: 1.949 Gbits/s
> throughput: 16.859 Mbits/s
> URB decode CPU usage 0.157000 %
>
> COHERENT
> frames:  999
> packets: 999
> empty:   0 (0 %)
> errors:  0
> invalid: 0
> pts: 0 early, 0 initial, 999 ok
> scr: 0 count ok, 0 diff ok
> sof: 2048 <= sof <= 0, freq 0.000 kHz
> bytes 71818320 : duration 33301
> FPS: 29.99
> URB: 321021/5000 uS/qty: 64.204 avg 23.001 std 10.430 min 268.837 max (uS)
> header: 54308/5000 uS/qty: 10.861 avg 5.104 std 0.778 min 54.736 max (uS)
> latency: 268799/5000 uS/qty: 53.759 avg 21.827 std 6.095 min 255.153 max (uS)
> decode: 5/5000 uS/qty: 10.444 avg 7.137 std 1.874 min 71.103 max (uS)
> raw decode speed: 11.048 Gbits/s
> raw URB handling speed: 1.789 Gbits/s
> throughput: 17.253 Mbits/s
> URB decode CPU usage 0.156800 %
>
> Signed-off-by: Ricardo Ribalda 
> Reviewed-by: Laurent Pinchart 
> Reviewed-by: Tomasz Figa 
> Signed-off-by: Christoph Hellwig 
> ---
>
> Changelog from v3 (Thanks Laurent!):
>
> - Rename stream_dir and stream_to_dmadev to avoid collisions
> - Improve commit message
>
>  drivers/media/usb/uvc/uvc_video.c | 94 +++
>  drivers/media/usb/uvc/uvcvideo.h  |  5 +-
>  2 files changed, 73 insertions(+), 26 deletions(-)
>
> diff --git a/drivers/media/usb/uvc/uvc_video.c 
> b/drivers/media/usb/uvc/uvc_video.c
> index f2f565281e63..cdd8eb500bb7 100644
> --- a/drivers/media/usb/uvc/uvc_video.c
> +++ b/drivers/media/usb/uvc/uvc_video.c
> @@ -6,11 +6,14 @@
>   *  Laurent Pinchart (laurent.pinch...@ideasonboard.com)
>   */
>
> +#include 
> +#include 
>  #include 
>  #include 
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -1096,6 +1099,29 @@ static int uvc_video_decode_start(struct uvc_s

[PATCH v4 6/6] media: uvcvideo: Use dma_alloc_noncontiguous API

2021-03-12 Thread Ricardo Ribalda
On architectures where there is no coherent caching such as ARM use the
dma_alloc_noncontiguous API and handle manually the cache flushing using
dma_sync_sgtable().

If the architechture has coherent cache, the API falls back to
alloc_dma_pages, so we can remove the coherent caching code-path from the
driver, making it simpler.

With this patch on the affected architectures we can measure up to 20x
performance improvement in uvc_video_copy_data_work().

Eg: aarch64 with an external usb camera

NON_CONTIGUOUS
frames:  999
packets: 999
empty:   0 (0 %)
errors:  0
invalid: 0
pts: 0 early, 0 initial, 999 ok
scr: 0 count ok, 0 diff ok
sof: 2048 <= sof <= 0, freq 0.000 kHz
bytes 67034480 : duration 33303
FPS: 29.99
URB: 523446/4993 uS/qty: 104.836 avg 132.532 std 13.230 min 831.094 max (uS)
header: 76564/4993 uS/qty: 15.334 avg 15.229 std 3.438 min 186.875 max (uS)
latency: 468945/4992 uS/qty: 93.939 avg 132.577 std 9.531 min 824.010 max (uS)
decode: 54161/4993 uS/qty: 10.847 avg 6.313 std 1.614 min 111.458 max (uS)
raw decode speed: 9.931 Gbits/s
raw URB handling speed: 1.025 Gbits/s
throughput: 16.102 Mbits/s
URB decode CPU usage 0.162600 %

COHERENT
frames:  999
packets: 999
empty:   0 (0 %)
errors:  0
invalid: 0
pts: 0 early, 0 initial, 999 ok
scr: 0 count ok, 0 diff ok
sof: 2048 <= sof <= 0, freq 0.000 kHz
bytes 54683536 : duration 33302
FPS: 29.99
URB: 1478135/4000 uS/qty: 369.533 avg 390.357 std 22.968 min 3337.865 max (uS)
header: 79761/4000 uS/qty: 19.940 avg 18.495 std 1.875 min 336.719 max (uS)
latency: 281077/4000 uS/qty: 70.269 avg 83.102 std 5.104 min 735.000 max (uS)
decode: 1197057/4000 uS/qty: 299.264 avg 318.080 std 1.615 min 2806.667 max (uS)
raw decode speed: 365.470 Mbits/s
raw URB handling speed: 295.986 Mbits/s
throughput: 13.136 Mbits/s
URB decode CPU usage 3.594500 %

In non-affected architectures we see no significant impact.

Eg: x86 with an external usb camera

NON_CONTIGUOUS
frames:  999
packets: 999
empty:   0 (0 %)
errors:  0
invalid: 0
pts: 0 early, 0 initial, 999 ok
scr: 0 count ok, 0 diff ok
sof: 2048 <= sof <= 0, freq 0.000 kHz
bytes 70179056 : duration 33301
FPS: 29.99
URB: 288901/4897 uS/qty: 58.995 avg 26.022 std 4.319 min 253.853 max (uS)
header: 54792/4897 uS/qty: 11.189 avg 6.218 std 0.620 min 61.750 max (uS)
latency: 236602/4897 uS/qty: 48.315 avg 24.244 std 1.764 min 240.924 max (uS)
decode: 52298/4897 uS/qty: 10.679 avg 8.299 std 1.638 min 108.861 max (uS)
raw decode speed: 10.796 Gbits/s
raw URB handling speed: 1.949 Gbits/s
throughput: 16.859 Mbits/s
URB decode CPU usage 0.157000 %

COHERENT
frames:  999
packets: 999
empty:   0 (0 %)
errors:  0
invalid: 0
pts: 0 early, 0 initial, 999 ok
scr: 0 count ok, 0 diff ok
sof: 2048 <= sof <= 0, freq 0.000 kHz
bytes 71818320 : duration 33301
FPS: 29.99
URB: 321021/5000 uS/qty: 64.204 avg 23.001 std 10.430 min 268.837 max (uS)
header: 54308/5000 uS/qty: 10.861 avg 5.104 std 0.778 min 54.736 max (uS)
latency: 268799/5000 uS/qty: 53.759 avg 21.827 std 6.095 min 255.153 max (uS)
decode: 5/5000 uS/qty: 10.444 avg 7.137 std 1.874 min 71.103 max (uS)
raw decode speed: 11.048 Gbits/s
raw URB handling speed: 1.789 Gbits/s
throughput: 17.253 Mbits/s
URB decode CPU usage 0.156800 %

Signed-off-by: Ricardo Ribalda 
Reviewed-by: Laurent Pinchart 
Reviewed-by: Tomasz Figa 
Signed-off-by: Christoph Hellwig 
---

Changelog from v3 (Thanks Laurent!):

- Rename stream_dir and stream_to_dmadev to avoid collisions
- Improve commit message

 drivers/media/usb/uvc/uvc_video.c | 94 +++
 drivers/media/usb/uvc/uvcvideo.h  |  5 +-
 2 files changed, 73 insertions(+), 26 deletions(-)

diff --git a/drivers/media/usb/uvc/uvc_video.c 
b/drivers/media/usb/uvc/uvc_video.c
index f2f565281e63..cdd8eb500bb7 100644
--- a/drivers/media/usb/uvc/uvc_video.c
+++ b/drivers/media/usb/uvc/uvc_video.c
@@ -6,11 +6,14 @@
  *  Laurent Pinchart (laurent.pinch...@ideasonboard.com)
  */
 
+#include 
+#include 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1096,6 +1099,29 @@ static int uvc_video_decode_start(struct uvc_streaming 
*stream,
return data[0];
 }
 
+static inline enum dma_data_direction uvc_stream_dir(
+   struct uvc_streaming *stream)
+{
+   if (stream->type == V4L2_BUF_TYPE_VIDEO_CAPTURE)
+   return DMA_FROM_DEVICE;
+   else
+   return DMA_TO_DEVICE;
+}
+
+static inline struct device *uvc_stream_to_dmadev(struct uvc_streaming *stream)
+{
+   return bus_to_hcd(stream->dev->udev->bus)->self.sysdev;
+}
+
+static int uvc_submit_urb(struct uvc_urb *uvc_urb, gfp_t mem_flags)
+{
+   /* Sync DMA. */
+   dma_sync_sgtable_for_device(uvc_stream_to_dmadev(uvc_urb->stream),
+   uvc_urb->sgt,
+   uvc_stream_dir(uvc_urb->stream));
+   return usb_submit_urb(uvc_urb->urb, GFP_KERNEL);
+}

Re: [PATCH v2 6/6] media: uvcvideo: Use dma_alloc_noncontiguous API

2021-03-12 Thread Ricardo Ribalda Delgado
Hi Laurent

Thanks a lot for the review

On Fri, Mar 12, 2021 at 10:19 PM Laurent Pinchart
 wrote:
>
> Hi Ricardo,
>
> Thank you for the patch.
>
> On Fri, Mar 12, 2021 at 01:57:09PM +0100, Ricardo Ribalda wrote:
> > On architectures where there is no coherent caching such as ARM use the
> > dma_alloc_noncontiguous API and handle manually the cache flushing using
> > dma_sync_sgtable().
>
> You're actually switching to dma_alloc_noncontiguous() unconditionally,
> not only on those architectures :-) Do I assume correctly that
> dma_alloc_noncontiguous() will do the right thing on x86 too ?

Yes dma_alloc_noncontiguous does all the magic. It falls back to
alloc_dma_pages. Christoph has done a great job.

I tried it in my notebook and although the images are nothing but
pretty it is not the APIs fault. It is because the barbers are closed
due to the pandemic ;).

>
> > With this patch on the affected architectures we can measure up to 20x
> > performance improvement in uvc_video_copy_data_work().
>
> Have you measured performances on x86 to ensure there's no regression ?

Yes in the early stages. I am adding the latest x86 measurements in
the commit message in v3 to make it more clear.

>
> > Eg: aarch64 with an external usb camera
> >
> > NON_CONTIGUOUS
> > frames:  999
> > packets: 999
> > empty:   0 (0 %)
> > errors:  0
> > invalid: 0
> > pts: 0 early, 0 initial, 999 ok
> > scr: 0 count ok, 0 diff ok
> > sof: 2048 <= sof <= 0, freq 0.000 kHz
> > bytes 67034480 : duration 33303
> > FPS: 29.99
> > URB: 523446/4993 uS/qty: 104.836 avg 132.532 std 13.230 min 831.094 max (uS)
> > header: 76564/4993 uS/qty: 15.334 avg 15.229 std 3.438 min 186.875 max (uS)
> > latency: 468945/4992 uS/qty: 93.939 avg 132.577 std 9.531 min 824.010 max 
> > (uS)
> > decode: 54161/4993 uS/qty: 10.847 avg 6.313 std 1.614 min 111.458 max (uS)
> > raw decode speed: 9.931 Gbits/s
> > raw URB handling speed: 1.025 Gbits/s
> > throughput: 16.102 Mbits/s
> > URB decode CPU usage 0.162600 %
> >
> > COHERENT
> > frames:  999
> > packets: 999
> > empty:   0 (0 %)
> > errors:  0
> > invalid: 0
> > pts: 0 early, 0 initial, 999 ok
> > scr: 0 count ok, 0 diff ok
> > sof: 2048 <= sof <= 0, freq 0.000 kHz
> > bytes 54683536 : duration 33302
> > FPS: 29.99
> > URB: 1478135/4000 uS/qty: 369.533 avg 390.357 std 22.968 min 3337.865 max 
> > (uS)
> > header: 79761/4000 uS/qty: 19.940 avg 18.495 std 1.875 min 336.719 max (uS)
> > latency: 281077/4000 uS/qty: 70.269 avg 83.102 std 5.104 min 735.000 max 
> > (uS)
> > decode: 1197057/4000 uS/qty: 299.264 avg 318.080 std 1.615 min 2806.667 max 
> > (uS)
> > raw decode speed: 365.470 Mbits/s
> > raw URB handling speed: 295.986 Mbits/s
> > throughput: 13.136 Mbits/s
> > URB decode CPU usage 3.594500 %
> >
> > Signed-off-by: Ricardo Ribalda 
> > Reviewed-by: Tomasz Figa 
> > Signed-off-by: Christoph Hellwig 
> > ---
> >
> > Changelog from v2: (Thanks Laurent)
> >
> > - Fix typos
> > - Use the right dma direction if not capturing
> > - Clear sgt during free
> >
> >  drivers/media/usb/uvc/uvc_video.c | 92 +++
> >  drivers/media/usb/uvc/uvcvideo.h  |  5 +-
> >  2 files changed, 74 insertions(+), 23 deletions(-)
> >
> > diff --git a/drivers/media/usb/uvc/uvc_video.c 
> > b/drivers/media/usb/uvc/uvc_video.c
> > index f2f565281e63..8e60f81e2257 100644
> > --- a/drivers/media/usb/uvc/uvc_video.c
> > +++ b/drivers/media/usb/uvc/uvc_video.c
> > @@ -6,11 +6,14 @@
> >   *  Laurent Pinchart (laurent.pinch...@ideasonboard.com)
> >   */
> >
> > +#include 
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> > @@ -1096,6 +1099,34 @@ static int uvc_video_decode_start(struct 
> > uvc_streaming *stream,
> >   return data[0];
> >  }
> >
> > +static inline enum dma_data_direction stream_dir(struct uvc_streaming 
> > *stream)
> > +{
> > + if (stream->type == V4L2_BUF_TYPE_VIDEO_CAPTURE)
> > + return DMA_FROM_DEVICE;
> > + else
> > + return DMA_TO_DEVICE;
> > +}
> > +
> > +static inline struct device *stream_to_dmadev(struct uvc_streaming *stream)
> > +{
> > + return bus_to_hcd(stream->dev->udev->bus)->self.sysdev;
> > +}
> > +
> > +static void uvc_urb_dma_sync

[PATCH v3 6/6] media: uvcvideo: Use dma_alloc_noncontiguous API

2021-03-12 Thread Ricardo Ribalda
On architectures where there is no coherent caching such as ARM use the
dma_alloc_noncontiguous API and handle manually the cache flushing using
dma_sync_sgtable().

With this patch on the affected architectures we can measure up to 20x
performance improvement in uvc_video_copy_data_work().

Eg: aarch64 with an external usb camera

NON_CONTIGUOUS
frames:  999
packets: 999
empty:   0 (0 %)
errors:  0
invalid: 0
pts: 0 early, 0 initial, 999 ok
scr: 0 count ok, 0 diff ok
sof: 2048 <= sof <= 0, freq 0.000 kHz
bytes 67034480 : duration 33303
FPS: 29.99
URB: 523446/4993 uS/qty: 104.836 avg 132.532 std 13.230 min 831.094 max (uS)
header: 76564/4993 uS/qty: 15.334 avg 15.229 std 3.438 min 186.875 max (uS)
latency: 468945/4992 uS/qty: 93.939 avg 132.577 std 9.531 min 824.010 max (uS)
decode: 54161/4993 uS/qty: 10.847 avg 6.313 std 1.614 min 111.458 max (uS)
raw decode speed: 9.931 Gbits/s
raw URB handling speed: 1.025 Gbits/s
throughput: 16.102 Mbits/s
URB decode CPU usage 0.162600 %

COHERENT
frames:  999
packets: 999
empty:   0 (0 %)
errors:  0
invalid: 0
pts: 0 early, 0 initial, 999 ok
scr: 0 count ok, 0 diff ok
sof: 2048 <= sof <= 0, freq 0.000 kHz
bytes 54683536 : duration 33302
FPS: 29.99
URB: 1478135/4000 uS/qty: 369.533 avg 390.357 std 22.968 min 3337.865 max (uS)
header: 79761/4000 uS/qty: 19.940 avg 18.495 std 1.875 min 336.719 max (uS)
latency: 281077/4000 uS/qty: 70.269 avg 83.102 std 5.104 min 735.000 max (uS)
decode: 1197057/4000 uS/qty: 299.264 avg 318.080 std 1.615 min 2806.667 max (uS)
raw decode speed: 365.470 Mbits/s
raw URB handling speed: 295.986 Mbits/s
throughput: 13.136 Mbits/s
URB decode CPU usage 3.594500 %

In non-affected architectures we see no significant impact.

Eg: x86 with an external usb camera

NON_CONTIGUOUS
frames:  999
packets: 999
empty:   0 (0 %)
errors:  0
invalid: 0
pts: 0 early, 0 initial, 999 ok
scr: 0 count ok, 0 diff ok
sof: 2048 <= sof <= 0, freq 0.000 kHz
bytes 70179056 : duration 33301
FPS: 29.99
URB: 288901/4897 uS/qty: 58.995 avg 26.022 std 4.319 min 253.853 max (uS)
header: 54792/4897 uS/qty: 11.189 avg 6.218 std 0.620 min 61.750 max (uS)
latency: 236602/4897 uS/qty: 48.315 avg 24.244 std 1.764 min 240.924 max (uS)
decode: 52298/4897 uS/qty: 10.679 avg 8.299 std 1.638 min 108.861 max (uS)
raw decode speed: 10.796 Gbits/s
raw URB handling speed: 1.949 Gbits/s
throughput: 16.859 Mbits/s
URB decode CPU usage 0.157000 %

COHERENT
frames:  999
packets: 999
empty:   0 (0 %)
errors:  0
invalid: 0
pts: 0 early, 0 initial, 999 ok
scr: 0 count ok, 0 diff ok
sof: 2048 <= sof <= 0, freq 0.000 kHz
bytes 71818320 : duration 33301
FPS: 29.99
URB: 321021/5000 uS/qty: 64.204 avg 23.001 std 10.430 min 268.837 max (uS)
header: 54308/5000 uS/qty: 10.861 avg 5.104 std 0.778 min 54.736 max (uS)
latency: 268799/5000 uS/qty: 53.759 avg 21.827 std 6.095 min 255.153 max (uS)
decode: 5/5000 uS/qty: 10.444 avg 7.137 std 1.874 min 71.103 max (uS)
raw decode speed: 11.048 Gbits/s
raw URB handling speed: 1.789 Gbits/s
throughput: 17.253 Mbits/s
URB decode CPU usage 0.156800 %

Signed-off-by: Ricardo Ribalda 
Reviewed-by: Laurent Pinchart 
Reviewed-by: Tomasz Figa 
Signed-off-by: Christoph Hellwig 
---

Changelog from v2 (Thanks Laurent!):

- Replace uvc_urb_dma_sync with uvc_submit_urb
- Add x86 stats in commit message

 drivers/media/usb/uvc/uvc_video.c | 92 ++-
 drivers/media/usb/uvc/uvcvideo.h  |  5 +-
 2 files changed, 71 insertions(+), 26 deletions(-)

diff --git a/drivers/media/usb/uvc/uvc_video.c 
b/drivers/media/usb/uvc/uvc_video.c
index f2f565281e63..37ee39412b83 100644
--- a/drivers/media/usb/uvc/uvc_video.c
+++ b/drivers/media/usb/uvc/uvc_video.c
@@ -6,11 +6,14 @@
  *  Laurent Pinchart (laurent.pinch...@ideasonboard.com)
  */
 
+#include 
+#include 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1096,6 +1099,28 @@ static int uvc_video_decode_start(struct uvc_streaming 
*stream,
return data[0];
 }
 
+static inline enum dma_data_direction stream_dir(struct uvc_streaming *stream)
+{
+   if (stream->type == V4L2_BUF_TYPE_VIDEO_CAPTURE)
+   return DMA_FROM_DEVICE;
+   else
+   return DMA_TO_DEVICE;
+}
+
+static inline struct device *stream_to_dmadev(struct uvc_streaming *stream)
+{
+   return bus_to_hcd(stream->dev->udev->bus)->self.sysdev;
+}
+
+static int uvc_submit_urb(struct uvc_urb *uvc_urb, gfp_t mem_flags)
+{
+   /* Sync DMA. */
+   dma_sync_sgtable_for_device(stream_to_dmadev(uvc_urb->stream),
+   uvc_urb->sgt,
+   stream_dir(uvc_urb->stream));
+   return usb_submit_urb(uvc_urb->urb, GFP_KERNEL);
+}
+
 /*
  * uvc_video_decode_data_work: Asynchronous memcpy processing
  *
@@ -1117,7 +1142,7 @@ static void uvc_video_copy_data_work(struct work_struct 
*work)
uvc_queue_buffer_release(op->buf);

[PATCH v2 6/6] media: uvcvideo: Use dma_alloc_noncontiguous API

2021-03-12 Thread Ricardo Ribalda
On architectures where there is no coherent caching such as ARM use the
dma_alloc_noncontiguous API and handle manually the cache flushing using
dma_sync_sgtable().

With this patch on the affected architectures we can measure up to 20x
performance improvement in uvc_video_copy_data_work().

Eg: aarch64 with an external usb camera

NON_CONTIGUOUS
frames:  999
packets: 999
empty:   0 (0 %)
errors:  0
invalid: 0
pts: 0 early, 0 initial, 999 ok
scr: 0 count ok, 0 diff ok
sof: 2048 <= sof <= 0, freq 0.000 kHz
bytes 67034480 : duration 33303
FPS: 29.99
URB: 523446/4993 uS/qty: 104.836 avg 132.532 std 13.230 min 831.094 max (uS)
header: 76564/4993 uS/qty: 15.334 avg 15.229 std 3.438 min 186.875 max (uS)
latency: 468945/4992 uS/qty: 93.939 avg 132.577 std 9.531 min 824.010 max (uS)
decode: 54161/4993 uS/qty: 10.847 avg 6.313 std 1.614 min 111.458 max (uS)
raw decode speed: 9.931 Gbits/s
raw URB handling speed: 1.025 Gbits/s
throughput: 16.102 Mbits/s
URB decode CPU usage 0.162600 %

COHERENT
frames:  999
packets: 999
empty:   0 (0 %)
errors:  0
invalid: 0
pts: 0 early, 0 initial, 999 ok
scr: 0 count ok, 0 diff ok
sof: 2048 <= sof <= 0, freq 0.000 kHz
bytes 54683536 : duration 33302
FPS: 29.99
URB: 1478135/4000 uS/qty: 369.533 avg 390.357 std 22.968 min 3337.865 max (uS)
header: 79761/4000 uS/qty: 19.940 avg 18.495 std 1.875 min 336.719 max (uS)
latency: 281077/4000 uS/qty: 70.269 avg 83.102 std 5.104 min 735.000 max (uS)
decode: 1197057/4000 uS/qty: 299.264 avg 318.080 std 1.615 min 2806.667 max (uS)
raw decode speed: 365.470 Mbits/s
raw URB handling speed: 295.986 Mbits/s
throughput: 13.136 Mbits/s
URB decode CPU usage 3.594500 %

Signed-off-by: Ricardo Ribalda 
Reviewed-by: Tomasz Figa 
Signed-off-by: Christoph Hellwig 
---

Changelog from v2: (Thanks Laurent)

- Fix typos
- Use the right dma direction if not capturing
- Clear sgt during free

 drivers/media/usb/uvc/uvc_video.c | 92 +++
 drivers/media/usb/uvc/uvcvideo.h  |  5 +-
 2 files changed, 74 insertions(+), 23 deletions(-)

diff --git a/drivers/media/usb/uvc/uvc_video.c 
b/drivers/media/usb/uvc/uvc_video.c
index f2f565281e63..8e60f81e2257 100644
--- a/drivers/media/usb/uvc/uvc_video.c
+++ b/drivers/media/usb/uvc/uvc_video.c
@@ -6,11 +6,14 @@
  *  Laurent Pinchart (laurent.pinch...@ideasonboard.com)
  */
 
+#include 
+#include 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1096,6 +1099,34 @@ static int uvc_video_decode_start(struct uvc_streaming 
*stream,
return data[0];
 }
 
+static inline enum dma_data_direction stream_dir(struct uvc_streaming *stream)
+{
+   if (stream->type == V4L2_BUF_TYPE_VIDEO_CAPTURE)
+   return DMA_FROM_DEVICE;
+   else
+   return DMA_TO_DEVICE;
+}
+
+static inline struct device *stream_to_dmadev(struct uvc_streaming *stream)
+{
+   return bus_to_hcd(stream->dev->udev->bus)->self.sysdev;
+}
+
+static void uvc_urb_dma_sync(struct uvc_urb *uvc_urb, bool for_device)
+{
+   struct device *dma_dev = stream_to_dmadev(uvc_urb->stream);
+
+   if (for_device) {
+   dma_sync_sgtable_for_device(dma_dev, uvc_urb->sgt,
+   stream_dir(uvc_urb->stream));
+   } else {
+   dma_sync_sgtable_for_cpu(dma_dev, uvc_urb->sgt,
+stream_dir(uvc_urb->stream));
+   invalidate_kernel_vmap_range(uvc_urb->buffer,
+uvc_urb->stream->urb_size);
+   }
+}
+
 /*
  * uvc_video_decode_data_work: Asynchronous memcpy processing
  *
@@ -1117,6 +1148,8 @@ static void uvc_video_copy_data_work(struct work_struct 
*work)
uvc_queue_buffer_release(op->buf);
}
 
+   uvc_urb_dma_sync(uvc_urb, true);
+
ret = usb_submit_urb(uvc_urb->urb, GFP_KERNEL);
if (ret < 0)
dev_err(_urb->stream->intf->dev,
@@ -1541,10 +1574,12 @@ static void uvc_video_complete(struct urb *urb)
 * Process the URB headers, and optionally queue expensive memcpy tasks
 * to be deferred to a work queue.
 */
+   uvc_urb_dma_sync(uvc_urb, false);
stream->decode(uvc_urb, buf, buf_meta);
 
/* If no async work is needed, resubmit the URB immediately. */
if (!uvc_urb->async_operations) {
+   uvc_urb_dma_sync(uvc_urb, true);
ret = usb_submit_urb(uvc_urb->urb, GFP_ATOMIC);
if (ret < 0)
dev_err(>intf->dev,
@@ -1560,24 +1595,49 @@ static void uvc_video_complete(struct urb *urb)
  */
 static void uvc_free_urb_buffers(struct uvc_streaming *stream)
 {
+   struct device *dma_dev = stream_to_dmadev(stream);
struct uvc_urb *uvc_urb;
 
for_each_uvc_urb(uvc_urb, stream) {
if (!uvc_urb->buffer)
conti

Re: add a new dma_alloc_noncontiguous API v3

2021-03-11 Thread Ricardo Ribalda
Hi Christoph

I tried to run it in an arm device and it worked fine.


On Thu, Mar 11, 2021 at 5:52 PM Christoph Hellwig  wrote:
>
> Any comments?  Especially on the uvcvideo conversion?
>
> On Mon, Mar 01, 2021 at 09:52:30AM +0100, Christoph Hellwig wrote:
> > Hi all,
> >
> > this series adds the new noncontiguous DMA allocation API requested by
> > various media driver maintainers.
> >
> > Changes since v2:
> >  - rebased to Linux 5.12-rc1
> >  - dropped one already merged patch
> >  - pass an attrs argument to dma_alloc_noncontigous
> >  - clarify the dma_vmap_noncontiguous documentation a bit
> >  - fix double assignments in uvcvideo
> >
> > Changes since v1:
> >  - document that flush_kernel_vmap_range and invalidate_kernel_vmap_range
> >must be called once an allocation is mapped into KVA
> >  - add dma-debug support
> >  - remove the separate dma_handle argument, and instead create fully formed
> >DMA mapped scatterlists
> >  - use a directional allocation in uvcvideo
> >  - call invalidate_kernel_vmap_range from uvcvideo
> > ___
> > iommu mailing list
> > iommu@lists.linux-foundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/iommu
> ---end quoted text---



-- 
Ricardo Ribalda
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: add a new dma_alloc_noncontiguous API v2

2021-02-11 Thread Ricardo Ribalda
HI Christoph

On Thu, Feb 11, 2021 at 2:06 PM Christoph Hellwig  wrote:
>
> On Thu, Feb 11, 2021 at 10:08:18AM +0100, Ricardo Ribalda wrote:
> > Hi Christoph
> >
> > What are your merge plans for the uvc change?
> > http://git.infradead.org/users/hch/dma-mapping.git/commit/3dc47131f8aacc2093f68a9971d24c754e435520
> >
> > Are you going to remove the patch on your Merge request and then send
> > it for review to Laurent? or merge it through your tree with a S-o-B
> > him?
>
> I though I had all the ACKs to queue it up.  Is that not what was
> intended?  Queueing up the API without a user is generally a bad idea.
>

I am pretty sure we need the ack from Laurent. He maintains uvc

@Laurent Pinchart what are your thoughts?

Thanks!


-- 
Ricardo Ribalda
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: add a new dma_alloc_noncontiguous API v2

2021-02-11 Thread Ricardo Ribalda
Hi Christoph

What are your merge plans for the uvc change?
http://git.infradead.org/users/hch/dma-mapping.git/commit/3dc47131f8aacc2093f68a9971d24c754e435520

Are you going to remove the patch on your Merge request and then send
it for review to Laurent? or merge it through your tree with a S-o-B
him?

Thanks

On Tue, Feb 9, 2021 at 6:02 PM Christoph Hellwig  wrote:
>
> On Tue, Feb 09, 2021 at 03:46:13PM +0100, Ricardo Ribalda wrote:
> > Hi Christoph
> >
> > I have tested it in both arm and x86, since there are not significant
> > changes with the previous version I did not do a performance test.
>
> I'll take this as a Tested-by.



-- 
Ricardo Ribalda
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: add a new dma_alloc_noncontiguous API v2

2021-02-09 Thread Ricardo Ribalda
Hi Christoph

I have tested it in both arm and x86, since there are not significant
changes with the previous version I did not do a performance test.

Thanks!


On Tue, Feb 9, 2021 at 9:29 AM Ricardo Ribalda  wrote:
>
> Hi Christoph
>
> On Tue, Feb 9, 2021 at 9:22 AM Christoph Hellwig  wrote:
> >
> > On Mon, Feb 08, 2021 at 08:33:50PM +0900, Tomasz Figa wrote:
> > > Sorry for the delay. The whole series looks very good to me. Thanks a lot.
> > >
> > > Reviewed-by: Tomasz Figa 
> >
> > Thanks.
> >
> > Ricardo, do the uvcvideo changes look good to you?  I'd like to queue
> > the series up for this merge window.

Tested-by: Ricardo Ribalda 

>
> Let me test them in real hardware today.
>
> Thanks!
>
>
> --
> Ricardo Ribalda



-- 
Ricardo Ribalda
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: add a new dma_alloc_noncontiguous API v2

2021-02-09 Thread Ricardo Ribalda
Hi Christoph

On Tue, Feb 9, 2021 at 9:22 AM Christoph Hellwig  wrote:
>
> On Mon, Feb 08, 2021 at 08:33:50PM +0900, Tomasz Figa wrote:
> > Sorry for the delay. The whole series looks very good to me. Thanks a lot.
> >
> > Reviewed-by: Tomasz Figa 
>
> Thanks.
>
> Ricardo, do the uvcvideo changes look good to you?  I'd like to queue
> the series up for this merge window.

Let me test them in real hardware today.

Thanks!


-- 
Ricardo Ribalda
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 6/6] media: uvcvideo: Use dma_alloc_noncontiguos API

2021-01-28 Thread Ricardo Ribalda
On architectures where the is no coherent caching such as ARM use the
dma_alloc_noncontiguos API and handle manually the cache flushing using
dma_sync_sgtable().

With this patch on the affected architectures we can measure up to 20x
performance improvement in uvc_video_copy_data_work().

Eg: aarch64 with an external usb camera

NON_CONTIGUOUS
frames:  999
packets: 999
empty:   0 (0 %)
errors:  0
invalid: 0
pts: 0 early, 0 initial, 999 ok
scr: 0 count ok, 0 diff ok
sof: 2048 <= sof <= 0, freq 0.000 kHz
bytes 67034480 : duration 33303
FPS: 29.99
URB: 523446/4993 uS/qty: 104.836 avg 132.532 std 13.230 min 831.094 max (uS)
header: 76564/4993 uS/qty: 15.334 avg 15.229 std 3.438 min 186.875 max (uS)
latency: 468945/4992 uS/qty: 93.939 avg 132.577 std 9.531 min 824.010 max (uS)
decode: 54161/4993 uS/qty: 10.847 avg 6.313 std 1.614 min 111.458 max (uS)
raw decode speed: 9.931 Gbits/s
raw URB handling speed: 1.025 Gbits/s
throughput: 16.102 Mbits/s
URB decode CPU usage 0.162600 %

COHERENT
frames:  999
packets: 999
empty:   0 (0 %)
errors:  0
invalid: 0
pts: 0 early, 0 initial, 999 ok
scr: 0 count ok, 0 diff ok
sof: 2048 <= sof <= 0, freq 0.000 kHz
bytes 54683536 : duration 33302
FPS: 29.99
URB: 1478135/4000 uS/qty: 369.533 avg 390.357 std 22.968 min 3337.865 max (uS)
header: 79761/4000 uS/qty: 19.940 avg 18.495 std 1.875 min 336.719 max (uS)
latency: 281077/4000 uS/qty: 70.269 avg 83.102 std 5.104 min 735.000 max (uS)
decode: 1197057/4000 uS/qty: 299.264 avg 318.080 std 1.615 min 2806.667 max (uS)
raw decode speed: 365.470 Mbits/s
raw URB handling speed: 295.986 Mbits/s
throughput: 13.136 Mbits/s
URB decode CPU usage 3.594500 %

Signed-off-by: Ricardo Ribalda 
Signed-off-by: Christoph Hellwig 
---

v2: 
- Replace DMA_BIDIRECTIONAL with DMA_FROM_DEVICE
- Add invalidate_kernel_vmap_range
- Tested in an X86 notebook and an ARM Chromebook

 drivers/media/usb/uvc/uvc_video.c | 80 ++-
 drivers/media/usb/uvc/uvcvideo.h  |  4 +-
 2 files changed, 61 insertions(+), 23 deletions(-)

diff --git a/drivers/media/usb/uvc/uvc_video.c 
b/drivers/media/usb/uvc/uvc_video.c
index a6a441d92b94..378fb5f29920 100644
--- a/drivers/media/usb/uvc/uvc_video.c
+++ b/drivers/media/usb/uvc/uvc_video.c
@@ -6,11 +6,13 @@
  *  Laurent Pinchart (laurent.pinch...@ideasonboard.com)
  */
 
+#include 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1097,6 +1099,26 @@ static int uvc_video_decode_start(struct uvc_streaming 
*stream,
return data[0];
 }
 
+static inline struct device *stream_to_dmadev(struct uvc_streaming *stream)
+{
+   return bus_to_hcd(stream->dev->udev->bus)->self.sysdev;
+}
+
+static void uvc_urb_dma_sync(struct uvc_urb *uvc_urb, bool for_device)
+{
+   struct device *dma_dev = dma_dev = stream_to_dmadev(uvc_urb->stream);
+
+   if (for_device) {
+   dma_sync_sgtable_for_device(dma_dev, uvc_urb->sgt,
+   DMA_FROM_DEVICE);
+   } else {
+   dma_sync_sgtable_for_cpu(dma_dev, uvc_urb->sgt,
+DMA_FROM_DEVICE);
+   invalidate_kernel_vmap_range(uvc_urb->buffer,
+uvc_urb->stream->urb_size);
+   }
+}
+
 /*
  * uvc_video_decode_data_work: Asynchronous memcpy processing
  *
@@ -1118,6 +1140,8 @@ static void uvc_video_copy_data_work(struct work_struct 
*work)
uvc_queue_buffer_release(op->buf);
}
 
+   uvc_urb_dma_sync(uvc_urb, true);
+
ret = usb_submit_urb(uvc_urb->urb, GFP_KERNEL);
if (ret < 0)
uvc_printk(KERN_ERR, "Failed to resubmit video URB (%d).\n",
@@ -1539,10 +1563,12 @@ static void uvc_video_complete(struct urb *urb)
 * Process the URB headers, and optionally queue expensive memcpy tasks
 * to be deferred to a work queue.
 */
+   uvc_urb_dma_sync(uvc_urb, false);
stream->decode(uvc_urb, buf, buf_meta);
 
/* If no async work is needed, resubmit the URB immediately. */
if (!uvc_urb->async_operations) {
+   uvc_urb_dma_sync(uvc_urb, true);
ret = usb_submit_urb(uvc_urb->urb, GFP_ATOMIC);
if (ret < 0)
uvc_printk(KERN_ERR,
@@ -1559,24 +1585,47 @@ static void uvc_video_complete(struct urb *urb)
  */
 static void uvc_free_urb_buffers(struct uvc_streaming *stream)
 {
+   struct device *dma_dev = dma_dev = stream_to_dmadev(stream);
struct uvc_urb *uvc_urb;
 
for_each_uvc_urb(uvc_urb, stream) {
if (!uvc_urb->buffer)
continue;
 
-#ifndef CONFIG_DMA_NONCOHERENT
-   usb_free_coherent(stream->dev->udev, stream->urb_size,
- uvc_urb->buffer, uvc_urb->dma);
-#else
-   kfree(

Re: [PATCH 6/6] media: uvcvideo: Use dma_alloc_noncontiguos API

2021-01-28 Thread Ricardo Ribalda
HI Christoph

Thanks for your comments

On Thu, Jan 28, 2021 at 4:09 PM Christoph Hellwig  wrote:
>
> I just included this patch as-is, but here are a few comments:
>
> On Thu, Jan 28, 2021 at 03:58:37PM +0100, Christoph Hellwig wrote:
> > +static void uvc_urb_dma_sync(struct uvc_urb *uvc_urb, bool for_device)
> > +{
> > + struct device *dma_dev = dma_dev = stream_to_dmadev(uvc_urb->stream);
> > +
> > + if (for_device)
> > + dma_sync_sgtable_for_device(dma_dev, uvc_urb->sgt,
> > + DMA_FROM_DEVICE);
> > + else
> > + dma_sync_sgtable_for_cpu(dma_dev, uvc_urb->sgt,
> > +  DMA_FROM_DEVICE);
> > +}
>
> Given that we vmap the addresses this also needs
> flush_kernel_vmap_range / invalidate_kernel_vmap_range calls for
> VIVT architectures.

We only read from the device to the cpu. Then can we run only
invalidate_kernel_vmap_range() ?

something like ?
else {
  dma_sync_sgtable_for_cpu(dma_dev, uvc_urb->sgt, DMA_FROM_DEVICE);
   invalidate_kernel_vmap_range(uvc_urb->buffer,
uvc_urb->stream->urb_size );
}

Thanks!






--
Ricardo Ribalda
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 6/6] media: uvcvideo: Use dma_alloc_noncontiguos API

2021-01-28 Thread Ricardo Ribalda
On Thu, Jan 28, 2021 at 4:03 PM Christoph Hellwig  wrote:
>
> From: Ricardo Ribalda 
>
> On architectures where the is no coherent caching such as ARM use the
> dma_alloc_noncontiguos API and handle manually the cache flushing using
> dma_sync_sgtable().
>
> With this patch on the affected architectures we can measure up to 20x
> performance improvement in uvc_video_copy_data_work().
>
> Eg: aarch64 with an external usb camera
>
> NON_CONTIGUOUS
> frames:  999
> packets: 999
> empty:   0 (0 %)
> errors:  0
> invalid: 0
> pts: 0 early, 0 initial, 999 ok
> scr: 0 count ok, 0 diff ok
> sof: 2048 <= sof <= 0, freq 0.000 kHz
> bytes 67034480 : duration 33303
> FPS: 29.99
> URB: 523446/4993 uS/qty: 104.836 avg 132.532 std 13.230 min 831.094 max (uS)
> header: 76564/4993 uS/qty: 15.334 avg 15.229 std 3.438 min 186.875 max (uS)
> latency: 468945/4992 uS/qty: 93.939 avg 132.577 std 9.531 min 824.010 max (uS)
> decode: 54161/4993 uS/qty: 10.847 avg 6.313 std 1.614 min 111.458 max (uS)
> raw decode speed: 9.931 Gbits/s
> raw URB handling speed: 1.025 Gbits/s
> throughput: 16.102 Mbits/s
> URB decode CPU usage 0.162600 %
>
> COHERENT
> frames:  999
> packets: 999
> empty:   0 (0 %)
> errors:  0
> invalid: 0
> pts: 0 early, 0 initial, 999 ok
> scr: 0 count ok, 0 diff ok
> sof: 2048 <= sof <= 0, freq 0.000 kHz
> bytes 54683536 : duration 33302
> FPS: 29.99
> URB: 1478135/4000 uS/qty: 369.533 avg 390.357 std 22.968 min 3337.865 max (uS)
> header: 79761/4000 uS/qty: 19.940 avg 18.495 std 1.875 min 336.719 max (uS)
> latency: 281077/4000 uS/qty: 70.269 avg 83.102 std 5.104 min 735.000 max (uS)
> decode: 1197057/4000 uS/qty: 299.264 avg 318.080 std 1.615 min 2806.667 max 
> (uS)
> raw decode speed: 365.470 Mbits/s
> raw URB handling speed: 295.986 Mbits/s
> throughput: 13.136 Mbits/s
> URB decode CPU usage 3.594500 %
>
> Signed-off-by: Ricardo Ribalda 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/media/usb/uvc/uvc_video.c | 76 ++-
>  drivers/media/usb/uvc/uvcvideo.h  |  4 +-
>  2 files changed, 57 insertions(+), 23 deletions(-)
>
> diff --git a/drivers/media/usb/uvc/uvc_video.c 
> b/drivers/media/usb/uvc/uvc_video.c
> index a6a441d92b9488..9c051b55dc7bc6 100644
> --- a/drivers/media/usb/uvc/uvc_video.c
> +++ b/drivers/media/usb/uvc/uvc_video.c
> @@ -11,6 +11,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -1097,6 +1098,23 @@ static int uvc_video_decode_start(struct uvc_streaming 
> *stream,
> return data[0];
>  }
>
> +static inline struct device *stream_to_dmadev(struct uvc_streaming *stream)
> +{
> +   return bus_to_hcd(stream->dev->udev->bus)->self.sysdev;
> +}
> +
> +static void uvc_urb_dma_sync(struct uvc_urb *uvc_urb, bool for_device)
> +{
> +   struct device *dma_dev = dma_dev = stream_to_dmadev(uvc_urb->stream);
> +
> +   if (for_device)
> +   dma_sync_sgtable_for_device(dma_dev, uvc_urb->sgt,
> +   DMA_FROM_DEVICE);
> +   else
> +   dma_sync_sgtable_for_cpu(dma_dev, uvc_urb->sgt,
> +DMA_FROM_DEVICE);
> +}
> +
>  /*
>   * uvc_video_decode_data_work: Asynchronous memcpy processing
>   *
> @@ -1118,6 +1136,8 @@ static void uvc_video_copy_data_work(struct work_struct 
> *work)
> uvc_queue_buffer_release(op->buf);
> }
>
> +   uvc_urb_dma_sync(uvc_urb, true);
> +
> ret = usb_submit_urb(uvc_urb->urb, GFP_KERNEL);
> if (ret < 0)
> uvc_printk(KERN_ERR, "Failed to resubmit video URB (%d).\n",
> @@ -1539,10 +1559,12 @@ static void uvc_video_complete(struct urb *urb)
>  * Process the URB headers, and optionally queue expensive memcpy 
> tasks
>  * to be deferred to a work queue.
>  */
> +   uvc_urb_dma_sync(uvc_urb, false);
> stream->decode(uvc_urb, buf, buf_meta);
>
> /* If no async work is needed, resubmit the URB immediately. */
> if (!uvc_urb->async_operations) {
> +   uvc_urb_dma_sync(uvc_urb, true);
> ret = usb_submit_urb(uvc_urb->urb, GFP_ATOMIC);
> if (ret < 0)
> uvc_printk(KERN_ERR,
> @@ -1559,24 +1581,47 @@ static void uvc_video_complete(struct urb *urb)
>   */
>  static void uvc_free_urb_buffers(struct uvc_streaming *stream)
>  {
> +   struct device *dma_dev = dma_dev = stream_to_dmadev(stream);
> struct uvc_urb *uvc_urb;
>
> for_each_uvc_urb(uvc_urb, stream

Re: [PATCH v3 5/6] media: uvcvideo: Use dma_alloc_noncontiguos API

2021-01-27 Thread Ricardo Ribalda
Hi Christoph

On Wed, Jan 27, 2021 at 4:56 PM . Christoph Hellwig  wrote:
>
> On Wed, Jan 27, 2021 at 12:29:08AM +0100, Ricardo Ribalda wrote:
> > - Is there any platform where dma_alloc_noncontiguos can fail?
> > This is, !ops->alloc_noncontiguous and !dev->coherent_dma_mask
> > If yes then we need to add a function to let the driver know in
> > advance that it has to use the coherent allocator (usb_alloc_coherent
> > for uvc)
>
> dev->coherent_dma_mask is set by the driver.  So the only reason why
> dma_alloc_noncontiguos will fail is because is because it can't
> allocate any memory.
>
> > - In dma_alloc_noncontiguos, on the dma_alloc_pages fallback. If we
> > have a device where the dma happens in only one direction, could not
> > get more performance with DMA_FROM/TO_DEVICE instead of
> > DMA_BIDIRECTIONAL ?
>
> Yes, we could probably do that.
>
> >
> >
> > Then I have tried to use the API, and I have encountered a problem: on
> > uvcvideo the device passed to the memory allocator is different for
> > DMA_PAGES and NON_CONTIGUOUS:
> > https://github.com/ribalda/linux/blob/042cd497739f71c8d4a83a67ee970369e2baca4a/drivers/media/usb/uvc/uvc_video.c#L1236
> >
> > I need to dig a bit tomorrow to figure out why this is, I have
> > hardware to test both paths, so it should not be too difficult.
>
> I always found the USB dma alloc API a little weird, but we might have
> to follow the scheme of the usb coherent wrappers there.

I have used the current API here:

https://git.kernel.org/pub/scm/linux/kernel/git/ribalda/linux.git/log/?h=uvc-noncontiguous

And I think the result is very clean. Great work!

I have tested it in X86 and in arm64, with similar performance as the
previous patchset.

Maybe you want to cherry pick that commit into your series I can also
send the patch to the list for review if you prefer so.

At least in 5.11 rc5 I the same dmadev worked in arm64 and x86.

Best regards!


-- 
Ricardo Ribalda
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 5/6] media: uvcvideo: Use dma_alloc_noncontiguos API

2021-01-26 Thread Ricardo Ribalda
Hi Christoph

Thanks for the series!

I have a couple of questions:

- Is there any platform where dma_alloc_noncontiguos can fail?
This is, !ops->alloc_noncontiguous and !dev->coherent_dma_mask
If yes then we need to add a function to let the driver know in
advance that it has to use the coherent allocator (usb_alloc_coherent
for uvc)

- In dma_alloc_noncontiguos, on the dma_alloc_pages fallback. If we
have a device where the dma happens in only one direction, could not
get more performance with DMA_FROM/TO_DEVICE instead of
DMA_BIDIRECTIONAL ?


Then I have tried to use the API, and I have encountered a problem: on
uvcvideo the device passed to the memory allocator is different for
DMA_PAGES and NON_CONTIGUOUS:
https://github.com/ribalda/linux/blob/042cd497739f71c8d4a83a67ee970369e2baca4a/drivers/media/usb/uvc/uvc_video.c#L1236

I need to dig a bit tomorrow to figure out why this is, I have
hardware to test both paths, so it should not be too difficult.


Thanks again






On Tue, Jan 26, 2021 at 6:07 PM . Christoph Hellwig  wrote:
>
> Please take a quick look at this branch:
>
> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dma_alloc_noncontiguous
>
> Warning: hot off the press, and only with the v4l conversion as that
> seemed at little easier than uvcvideo.



--
Ricardo Ribalda
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 5/6] media: uvcvideo: Use dma_alloc_noncontiguos API

2021-01-15 Thread Ricardo Ribalda
On Mon, Jan 11, 2021 at 9:36 AM . Christoph Hellwig  wrote:
>
> On Thu, Jan 07, 2021 at 03:14:08PM +0100, Ricardo Ribalda wrote:
> > > So I think we do want these branches for coherent vs non-coherent as they
> > > have very different semantics and I do not think that hiding them under
> > > the same API helps people to understand those vastly different semantics.
> > >
> > > OTOH we should look into a fallback for DMA API instances that do not
> > > support the discontigous allocations.
> > >
> > > I think between your comments and those from Ricardo I have a good idea
> > > for a somewhat updated API.  I'll try to post something in the next days.
> >
> > Did you have time to look into this?
> >
> > No hurry, I just want to make sure that I didn't miss anything ;)
>
> Haven't managed to get to it, sorry.

No worries!, is there something we can do to help you with this?

>
> >
> > Best regards!
> >
> >
> >
> > --
> > Ricardo Ribalda
> ---end quoted text---



-- 
Ricardo Ribalda
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 5/6] media: uvcvideo: Use dma_alloc_noncontiguos API

2021-01-07 Thread Ricardo Ribalda
Hi Christoph

Happy new year!

On Wed, Dec 9, 2020 at 12:16 PM . Christoph Hellwig  wrote:
>
> On Tue, Dec 08, 2020 at 04:13:20PM +0900, Sergey Senozhatsky wrote:
> > On (20/12/08 13:54), Tomasz Figa wrote:
> > >
> > > In any case, Sergey is going to share a preliminary patch on how the
> > > current API would be used in the V4L2 videobuf2 framework. That should
> > > give us more input on how such a helper could look.
> >
> > HUGE apologies for the previous screw up! I replied in the
> > gmail web-interface and that did not work out as expected
> > (at all, big times).
>
> Actually the previous mail was a mime multipart one, and the plain text
> version displayed just fine here.  My the gmail engineers finally learned
> something after all.
>
> > Another thing to notice is that the new API requires us to have two 
> > execution branches
> > in allocators - one for the current API; and one for the new API (if it's 
> > supported and
> > if user-space requested non-coherent allocation).
>
> So I think we do want these branches for coherent vs non-coherent as they
> have very different semantics and I do not think that hiding them under
> the same API helps people to understand those vastly different semantics.
>
> OTOH we should look into a fallback for DMA API instances that do not
> support the discontigous allocations.
>
> I think between your comments and those from Ricardo I have a good idea
> for a somewhat updated API.  I'll try to post something in the next days.

Did you have time to look into this?

No hurry, I just want to make sure that I didn't miss anything ;)

Best regards!



-- 
Ricardo Ribalda
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 5/6] media: uvcvideo: Use dma_alloc_noncontiguos API

2020-11-30 Thread Ricardo Ribalda
Hi Christoph

On Mon, Nov 30, 2020 at 9:34 AM Christoph Hellwig  wrote:
>
> > +#ifndef CONFIG_DMA_NONCOHERENT
>
> I think you need to drop this ifdef.  This code should work just fine
> on noncoherent mips and sh platforms.
>
> > + uvc_urb->pages = dma_alloc_noncontiguous(dma_dev, stream->urb_size,
> > +  _urb->dma,
> > +  gfp_flags | __GFP_NOWARN, 0);
> > + if (!uvc_urb->pages)
> > + return false;
> > +
> > + uvc_urb->buffer = vmap(uvc_urb->pages,
> > +PAGE_ALIGN(stream->urb_size) >> PAGE_SHIFT,
> > +VM_DMA_COHERENT, PAGE_KERNEL);
> > + if (!uvc_urb->buffer) {
> > + dma_free_noncontiguous(dma_dev, stream->urb_size,
> > +uvc_urb->pages, uvc_urb->dma);
> > + return false;
> > + }
> > +
> > + if (sg_alloc_table_from_pages(_urb->sgt, uvc_urb->pages,
> > + PAGE_ALIGN(stream->urb_size) >> PAGE_SHIFT, 0,
> > + stream->urb_size, GFP_KERNEL)) {
> > + vunmap(uvc_urb->buffer);
> > + dma_free_noncontiguous(dma_dev, stream->urb_size,
> > +uvc_urb->pages, uvc_urb->dma);
> > + return false;
> > + }
> > +
> > + return true;
> > +}
>
> I wonder if we should lift this into a helper.  On the one hand I had
> proliferating struct scatterlist usage, on the other hand it is all over
> the media and drm code anyway, and duplicating this doesn't help anyone.
>
> Possibly including the fallback to the coherent allocating.

Probably Sergey has best opinion of this than mine. I only had to look
into one driver, he has been working with the vb2, which uses the API
much more.



-- 
Ricardo Ribalda
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v4 4/4] TEST-ONLY: media: uvcvideo: Add statistics for measuring performance

2020-11-30 Thread Ricardo Ribalda
From: Shik Chen 

Majorly based on [1], with the following tweaks:

* Use div_u64 for u64 divisions
* Calculate standard deviation
* Fix an uninitialized |min| field for header
* Apply clang-format

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/kbingham/rcar.git/commit/?h=uvc/async-ml=cebbd1b629bbe5f856ec5dc7591478c003f5a944

Signed-off-by: Shik Chen 
---
 drivers/media/usb/uvc/uvc_video.c | 163 +-
 drivers/media/usb/uvc/uvcvideo.h  |  21 
 2 files changed, 181 insertions(+), 3 deletions(-)

diff --git a/drivers/media/usb/uvc/uvc_video.c 
b/drivers/media/usb/uvc/uvc_video.c
index 59ade244abfb..34a27191ef0d 100644
--- a/drivers/media/usb/uvc/uvc_video.c
+++ b/drivers/media/usb/uvc/uvc_video.c
@@ -906,12 +906,61 @@ static void uvc_video_stats_update(struct uvc_streaming 
*stream)
memset(>stats.frame, 0, sizeof(stream->stats.frame));
 }
 
+size_t uvc_video_dump_time_stats(char *buf, size_t size,
+struct uvc_stats_time *stat, const char *pfx)
+{
+   unsigned int avg = 0;
+   unsigned int std = 0;
+
+   if (stat->qty) {
+   avg = div_u64(stat->duration, stat->qty);
+   std = int_sqrt64(div_u64(stat->duration2, stat->qty) -
+avg * avg);
+   }
+
+   /* Stat durations are in nanoseconds, we present in micro-seconds */
+   return scnprintf(
+   buf, size,
+   "%s: %llu/%u uS/qty: %u.%03u avg %u.%03u std %u.%03u min 
%u.%03u max (uS)\n",
+   pfx, div_u64(stat->duration, 1000), stat->qty, avg / 1000,
+   avg % 1000, std / 1000, std % 1000, stat->min / 1000,
+   stat->min % 1000, stat->max / 1000, stat->max % 1000);
+}
+
+size_t uvc_video_dump_speed(char *buf, size_t size, const char *pfx, u64 bytes,
+   u64 milliseconds)
+{
+   unsigned int rate = 0;
+   bool gbit = false;
+
+   if (milliseconds)
+   rate = div_u64(bytes * 8, milliseconds);
+
+   if (rate >= 100) {
+   gbit = true;
+   rate /= 1000;
+   }
+
+   /*
+* bits/milliseconds == kilobits/seconds,
+* presented here as Mbits/s (or Gbit/s) with 3 decimal places
+*/
+   return scnprintf(buf, size, "%s: %d.%03d %sbits/s\n", pfx, rate / 1000,
+rate % 1000, gbit ? "G" : "M");
+}
+
 size_t uvc_video_stats_dump(struct uvc_streaming *stream, char *buf,
size_t size)
 {
+   u64 bytes = stream->stats.stream.bytes; /* Single sample */
+   unsigned int empty_ratio = 0;
unsigned int scr_sof_freq;
unsigned int duration;
+   unsigned int fps = 0;
size_t count = 0;
+   u64 cpu = 0;
+   u64 cpu_q = 0;
+   u32 cpu_r = 0;
 
/* Compute the SCR.SOF frequency estimate. At the nominal 1kHz SOF
 * frequency this will not overflow before more than 1h.
@@ -924,12 +973,19 @@ size_t uvc_video_stats_dump(struct uvc_streaming *stream, 
char *buf,
else
scr_sof_freq = 0;
 
+   if (stream->stats.stream.nb_packets)
+   empty_ratio = stream->stats.stream.nb_empty * 100 /
+ stream->stats.stream.nb_packets;
+
count += scnprintf(buf + count, size - count,
-  "frames:  %u\npackets: %u\nempty:   %u\n"
-  "errors:  %u\ninvalid: %u\n",
+  "frames:  %u\n"
+  "packets: %u\n"
+  "empty:   %u (%u %%)\n"
+  "errors:  %u\n"
+  "invalid: %u\n",
   stream->stats.stream.nb_frames,
   stream->stats.stream.nb_packets,
-  stream->stats.stream.nb_empty,
+  stream->stats.stream.nb_empty, empty_ratio,
   stream->stats.stream.nb_errors,
   stream->stats.stream.nb_invalid);
count += scnprintf(buf + count, size - count,
@@ -946,6 +1002,55 @@ size_t uvc_video_stats_dump(struct uvc_streaming *stream, 
char *buf,
   stream->stats.stream.min_sof,
   stream->stats.stream.max_sof,
   scr_sof_freq / 1000, scr_sof_freq % 1000);
+   count += scnprintf(buf + count, size - count,
+  "bytes %lld : duration %d\n", bytes, duration);
+
+   if (duration != 0) {
+   /* Duration is in milliseconds, * 100 to gain 2 dp precision */
+   fps = stream->stats.stream.nb_frames * 1000 * 100 / duration;
+   /* CPU usage as a % with 6 decimal places */
+   cpu = div_u64(stream->stats.urbstat.decode.duration, duration) *
+ 100;
+   }
+
+   count += scnprintf(buf + count, size - count, "FPS: %u.%02u\n",
+  fps / 100, 

[PATCH v4 2/4] WIP: add a dma_alloc_contiguous API

2020-11-30 Thread Ricardo Ribalda
From: Christoph Hellwig 

Add a new API that returns a virtually non-contigous array of pages
and dma address.  This API is only implemented for dma-iommu and will
not be implemented for non-iommu DMA API instances that have to allocate
contiguous memory.  It is up to the caller to check if the API is
available.

The intent is that media drivers can use this API if either:

 - no kernel mapping or only temporary kernel mappings are required.
   That is as a better replacement for DMA_ATTR_NO_KERNEL_MAPPING
 - a kernel mapping is required for cached and DMA mapped pages, but
   the driver also needs the pages to e.g. map them to userspace.
   In that sense it is a replacement for some aspects of the recently
   removed and never fully implemented DMA_ATTR_NON_CONSISTENT

Signed-off-by: Christoph Hellwig 
---
 drivers/iommu/dma-iommu.c   | 84 +
 include/linux/dma-map-ops.h |  4 ++
 include/linux/dma-mapping.h |  5 +++
 kernel/dma/mapping.c| 35 
 4 files changed, 91 insertions(+), 37 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 73249732afd3..2e72fe1b9c3b 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -564,23 +564,12 @@ static struct page **__iommu_dma_alloc_pages(struct 
device *dev,
return pages;
 }
 
-/**
- * iommu_dma_alloc_remap - Allocate and map a buffer contiguous in IOVA space
- * @dev: Device to allocate memory for. Must be a real device
- *  attached to an iommu_dma_domain
- * @size: Size of buffer in bytes
- * @dma_handle: Out argument for allocated DMA handle
- * @gfp: Allocation flags
- * @prot: pgprot_t to use for the remapped mapping
- * @attrs: DMA attributes for this allocation
- *
- * If @size is less than PAGE_SIZE, then a full CPU page will be allocated,
+/*
+ * If size is less than PAGE_SIZE, then a full CPU page will be allocated,
  * but an IOMMU which supports smaller pages might not map the whole thing.
- *
- * Return: Mapped virtual address, or NULL on failure.
  */
-static void *iommu_dma_alloc_remap(struct device *dev, size_t size,
-   dma_addr_t *dma_handle, gfp_t gfp, pgprot_t prot,
+static struct page **__iommu_dma_alloc_noncontiguous(struct device *dev,
+   size_t size, dma_addr_t *dma_handle, gfp_t gfp, pgprot_t prot,
unsigned long attrs)
 {
struct iommu_domain *domain = iommu_get_dma_domain(dev);
@@ -592,7 +581,6 @@ static void *iommu_dma_alloc_remap(struct device *dev, 
size_t size,
struct page **pages;
struct sg_table sgt;
dma_addr_t iova;
-   void *vaddr;
 
*dma_handle = DMA_MAPPING_ERROR;
 
@@ -635,17 +623,10 @@ static void *iommu_dma_alloc_remap(struct device *dev, 
size_t size,
< size)
goto out_free_sg;
 
-   vaddr = dma_common_pages_remap(pages, size, prot,
-   __builtin_return_address(0));
-   if (!vaddr)
-   goto out_unmap;
-
*dma_handle = iova;
sg_free_table();
-   return vaddr;
+   return pages;
 
-out_unmap:
-   __iommu_dma_unmap(dev, iova, size);
 out_free_sg:
sg_free_table();
 out_free_iova:
@@ -655,20 +636,45 @@ static void *iommu_dma_alloc_remap(struct device *dev, 
size_t size,
return NULL;
 }
 
-/**
- * __iommu_dma_mmap - Map a buffer into provided user VMA
- * @pages: Array representing buffer from __iommu_dma_alloc()
- * @size: Size of buffer in bytes
- * @vma: VMA describing requested userspace mapping
- *
- * Maps the pages of the buffer in @pages into @vma. The caller is responsible
- * for verifying the correct size and protection of @vma beforehand.
- */
-static int __iommu_dma_mmap(struct page **pages, size_t size,
-   struct vm_area_struct *vma)
+static void *iommu_dma_alloc_remap(struct device *dev, size_t size,
+   dma_addr_t *dma_handle, gfp_t gfp, pgprot_t prot,
+   unsigned long attrs)
 {
-   return vm_map_pages(vma, pages, PAGE_ALIGN(size) >> PAGE_SHIFT);
+   struct page **pages;
+   void *vaddr;
+
+   pages = __iommu_dma_alloc_noncontiguous(dev, size, dma_handle, gfp,
+   prot, attrs);
+   if (!pages)
+   return NULL;
+   vaddr = dma_common_pages_remap(pages, size, prot,
+   __builtin_return_address(0));
+   if (!vaddr)
+   goto out_unmap;
+   return vaddr;
+
+out_unmap:
+   __iommu_dma_unmap(dev, *dma_handle, size);
+   __iommu_dma_free_pages(pages, PAGE_ALIGN(size) >> PAGE_SHIFT);
+   return NULL;
+}
+
+#ifdef CONFIG_DMA_REMAP
+static struct page **iommu_dma_alloc_noncontiguous(struct device *dev,
+   size_t size, dma_addr_t *dma_handle, gfp_t gfp,
+   unsigned long attrs)
+{
+   return __iommu_dma_alloc_noncontiguous(dev, size, dma_handle, gfp,
+  PAGE_KERNEL, 

[PATCH v4 3/4] media: uvcvideo: Use dma_alloc_noncontiguos API

2020-11-30 Thread Ricardo Ribalda
On architectures where the is no coherent caching such as ARM use the
dma_alloc_noncontiguos API and handle manually the cache flushing using
dma_sync_sgtable().

With this patch on the affected architectures we can measure up to 20x
performance improvement in uvc_video_copy_data_work().

Signed-off-by: Ricardo Ribalda 
---
 drivers/media/usb/uvc/uvc_video.c | 92 +++
 drivers/media/usb/uvc/uvcvideo.h  |  2 +
 2 files changed, 72 insertions(+), 22 deletions(-)

v4: Thanks to Crhistoph and Sergei

- Remove the  CONFIG_DMA_NONCOHERENT path
- Do not pass  __GFP_NOWARN

diff --git a/drivers/media/usb/uvc/uvc_video.c 
b/drivers/media/usb/uvc/uvc_video.c
index a6a441d92b94..59ade244abfb 100644
--- a/drivers/media/usb/uvc/uvc_video.c
+++ b/drivers/media/usb/uvc/uvc_video.c
@@ -1097,6 +1097,11 @@ static int uvc_video_decode_start(struct uvc_streaming 
*stream,
return data[0];
 }
 
+static inline struct device *stream_to_dmadev(struct uvc_streaming *stream)
+{
+   return stream->dev->udev->bus->controller->parent;
+}
+
 /*
  * uvc_video_decode_data_work: Asynchronous memcpy processing
  *
@@ -1118,6 +1123,9 @@ static void uvc_video_copy_data_work(struct work_struct 
*work)
uvc_queue_buffer_release(op->buf);
}
 
+   if (uvc_urb->pages)
+   dma_sync_sgtable_for_device(stream_to_dmadev(uvc_urb->stream),
+   _urb->sgt, DMA_FROM_DEVICE);
ret = usb_submit_urb(uvc_urb->urb, GFP_KERNEL);
if (ret < 0)
uvc_printk(KERN_ERR, "Failed to resubmit video URB (%d).\n",
@@ -1539,10 +1547,17 @@ static void uvc_video_complete(struct urb *urb)
 * Process the URB headers, and optionally queue expensive memcpy tasks
 * to be deferred to a work queue.
 */
+   if (uvc_urb->pages)
+   dma_sync_sgtable_for_cpu(stream_to_dmadev(stream),
+_urb->sgt, DMA_FROM_DEVICE);
stream->decode(uvc_urb, buf, buf_meta);
 
/* If no async work is needed, resubmit the URB immediately. */
if (!uvc_urb->async_operations) {
+   if (uvc_urb->pages)
+   dma_sync_sgtable_for_device(stream_to_dmadev(stream),
+   _urb->sgt,
+   DMA_FROM_DEVICE);
ret = usb_submit_urb(uvc_urb->urb, GFP_ATOMIC);
if (ret < 0)
uvc_printk(KERN_ERR,
@@ -1565,18 +1580,61 @@ static void uvc_free_urb_buffers(struct uvc_streaming 
*stream)
if (!uvc_urb->buffer)
continue;
 
-#ifndef CONFIG_DMA_NONCOHERENT
-   usb_free_coherent(stream->dev->udev, stream->urb_size,
- uvc_urb->buffer, uvc_urb->dma);
-#else
-   kfree(uvc_urb->buffer);
-#endif
+   if (uvc_urb->pages) {
+   sg_free_table(_urb->sgt);
+   vunmap(uvc_urb->buffer);
+   dma_free_noncontiguous(stream_to_dmadev(stream),
+  stream->urb_size,
+  uvc_urb->pages, uvc_urb->dma);
+   } else {
+   usb_free_coherent(stream->dev->udev, stream->urb_size,
+ uvc_urb->buffer, uvc_urb->dma);
+   }
uvc_urb->buffer = NULL;
}
 
stream->urb_size = 0;
 }
 
+static bool uvc_alloc_urb_buffer(struct uvc_streaming *stream,
+struct uvc_urb *uvc_urb, gfp_t gfp_flags)
+{
+   struct device *dma_dev = dma_dev = stream_to_dmadev(stream);
+
+   if (!dma_can_alloc_noncontiguous(dma_dev)) {
+   uvc_urb->buffer = usb_alloc_coherent(stream->dev->udev,
+stream->urb_size,
+gfp_flags | __GFP_NOWARN,
+_urb->dma);
+   return uvc_urb->buffer != NULL;
+   }
+
+   uvc_urb->pages = dma_alloc_noncontiguous(dma_dev, stream->urb_size,
+_urb->dma, gfp_flags, 0);
+   if (!uvc_urb->pages)
+   return false;
+
+   uvc_urb->buffer = vmap(uvc_urb->pages,
+  PAGE_ALIGN(stream->urb_size) >> PAGE_SHIFT,
+  VM_MAP, PAGE_KERNEL);
+   if (!uvc_urb->buffer) {
+   dma_free_noncontiguous(dma_dev, stream->urb_size,
+  uvc_urb->pages, uvc_urb->dma);
+   return false;
+   }
+
+   if (sg_alloc

[PATCH v4 1/4] dma-mapping: remove the {alloc, free}_noncoherent methods

2020-11-30 Thread Ricardo Ribalda
From: Christoph Hellwig 

It turns out allowing non-contigous allocations here was a rather bad
idea, as we'll now need to define ways to get the pages for mmaping
or dma_buf sharing.  Revert this change and stick to the original
concept.  A different API for the use case of non-contigous allocations
will be added back later.

Signed-off-by: Christoph Hellwig 
---
 drivers/iommu/dma-iommu.c   | 30 --
 include/linux/dma-map-ops.h |  5 -
 kernel/dma/mapping.c| 33 ++---
 3 files changed, 6 insertions(+), 62 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 0cbcd3fc3e7e..73249732afd3 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1054,34 +1054,6 @@ static void *iommu_dma_alloc(struct device *dev, size_t 
size,
return cpu_addr;
 }
 
-#ifdef CONFIG_DMA_REMAP
-static void *iommu_dma_alloc_noncoherent(struct device *dev, size_t size,
-   dma_addr_t *handle, enum dma_data_direction dir, gfp_t gfp)
-{
-   if (!gfpflags_allow_blocking(gfp)) {
-   struct page *page;
-
-   page = dma_common_alloc_pages(dev, size, handle, dir, gfp);
-   if (!page)
-   return NULL;
-   return page_address(page);
-   }
-
-   return iommu_dma_alloc_remap(dev, size, handle, gfp | __GFP_ZERO,
-PAGE_KERNEL, 0);
-}
-
-static void iommu_dma_free_noncoherent(struct device *dev, size_t size,
-   void *cpu_addr, dma_addr_t handle, enum dma_data_direction dir)
-{
-   __iommu_dma_unmap(dev, handle, size);
-   __iommu_dma_free(dev, size, cpu_addr);
-}
-#else
-#define iommu_dma_alloc_noncoherentNULL
-#define iommu_dma_free_noncoherent NULL
-#endif /* CONFIG_DMA_REMAP */
-
 static int iommu_dma_mmap(struct device *dev, struct vm_area_struct *vma,
void *cpu_addr, dma_addr_t dma_addr, size_t size,
unsigned long attrs)
@@ -1152,8 +1124,6 @@ static const struct dma_map_ops iommu_dma_ops = {
.free   = iommu_dma_free,
.alloc_pages= dma_common_alloc_pages,
.free_pages = dma_common_free_pages,
-   .alloc_noncoherent  = iommu_dma_alloc_noncoherent,
-   .free_noncoherent   = iommu_dma_free_noncoherent,
.mmap   = iommu_dma_mmap,
.get_sgtable= iommu_dma_get_sgtable,
.map_page   = iommu_dma_map_page,
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index a5f89fc4d6df..3d1f91464bcf 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -22,11 +22,6 @@ struct dma_map_ops {
gfp_t gfp);
void (*free_pages)(struct device *dev, size_t size, struct page *vaddr,
dma_addr_t dma_handle, enum dma_data_direction dir);
-   void *(*alloc_noncoherent)(struct device *dev, size_t size,
-   dma_addr_t *dma_handle, enum dma_data_direction dir,
-   gfp_t gfp);
-   void (*free_noncoherent)(struct device *dev, size_t size, void *vaddr,
-   dma_addr_t dma_handle, enum dma_data_direction dir);
int (*mmap)(struct device *, struct vm_area_struct *,
void *, dma_addr_t, size_t, unsigned long attrs);
 
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 51bb8fa8eb89..d3032513c54b 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -514,40 +514,19 @@ EXPORT_SYMBOL_GPL(dma_free_pages);
 void *dma_alloc_noncoherent(struct device *dev, size_t size,
dma_addr_t *dma_handle, enum dma_data_direction dir, gfp_t gfp)
 {
-   const struct dma_map_ops *ops = get_dma_ops(dev);
-   void *vaddr;
-
-   if (!ops || !ops->alloc_noncoherent) {
-   struct page *page;
-
-   page = dma_alloc_pages(dev, size, dma_handle, dir, gfp);
-   if (!page)
-   return NULL;
-   return page_address(page);
-   }
+   struct page *page;
 
-   size = PAGE_ALIGN(size);
-   vaddr = ops->alloc_noncoherent(dev, size, dma_handle, dir, gfp);
-   if (vaddr)
-   debug_dma_map_page(dev, virt_to_page(vaddr), 0, size, dir,
-  *dma_handle);
-   return vaddr;
+   page = dma_alloc_pages(dev, size, dma_handle, dir, gfp);
+   if (!page)
+   return NULL;
+   return page_address(page);
 }
 EXPORT_SYMBOL_GPL(dma_alloc_noncoherent);
 
 void dma_free_noncoherent(struct device *dev, size_t size, void *vaddr,
dma_addr_t dma_handle, enum dma_data_direction dir)
 {
-   const struct dma_map_ops *ops = get_dma_ops(dev);
-
-   if (!ops || !ops->free_noncoherent) {
-   dma_free_pages(dev, size, virt_to_page(vaddr), dma_handle, dir);
- 

[PATCH v3 5/6] media: uvcvideo: Use dma_alloc_noncontiguos API

2020-11-25 Thread Ricardo Ribalda
On architectures where the is no coherent caching such as ARM use the
dma_alloc_noncontiguos API and handle manually the cache flushing using
dma_sync_sgtable().

With this patch on the affected architectures we can measure up to 20x
performance improvement in uvc_video_copy_data_work().

Signed-off-by: Ricardo Ribalda 
---

v3: Thanks to Marek Szyprowski

Use dma_sync_sgtable and _for_cpu()

 drivers/media/usb/uvc/uvc_video.c | 93 +++
 drivers/media/usb/uvc/uvcvideo.h  |  2 +
 2 files changed, 83 insertions(+), 12 deletions(-)

diff --git a/drivers/media/usb/uvc/uvc_video.c 
b/drivers/media/usb/uvc/uvc_video.c
index a6a441d92b94..06ebd7a3877b 100644
--- a/drivers/media/usb/uvc/uvc_video.c
+++ b/drivers/media/usb/uvc/uvc_video.c
@@ -1097,6 +1097,11 @@ static int uvc_video_decode_start(struct uvc_streaming 
*stream,
return data[0];
 }
 
+static inline struct device *stream_to_dmadev(struct uvc_streaming *stream)
+{
+   return stream->dev->udev->bus->controller->parent;
+}
+
 /*
  * uvc_video_decode_data_work: Asynchronous memcpy processing
  *
@@ -1118,6 +1123,9 @@ static void uvc_video_copy_data_work(struct work_struct 
*work)
uvc_queue_buffer_release(op->buf);
}
 
+   if (uvc_urb->pages)
+   dma_sync_sgtable_for_device(stream_to_dmadev(uvc_urb->stream),
+   _urb->sgt, DMA_FROM_DEVICE);
ret = usb_submit_urb(uvc_urb->urb, GFP_KERNEL);
if (ret < 0)
uvc_printk(KERN_ERR, "Failed to resubmit video URB (%d).\n",
@@ -1539,10 +1547,17 @@ static void uvc_video_complete(struct urb *urb)
 * Process the URB headers, and optionally queue expensive memcpy tasks
 * to be deferred to a work queue.
 */
+   if (uvc_urb->pages)
+   dma_sync_sgtable_for_cpu(stream_to_dmadev(stream),
+_urb->sgt, DMA_FROM_DEVICE);
stream->decode(uvc_urb, buf, buf_meta);
 
/* If no async work is needed, resubmit the URB immediately. */
if (!uvc_urb->async_operations) {
+   if (uvc_urb->pages)
+   dma_sync_sgtable_for_device(stream_to_dmadev(stream),
+   _urb->sgt,
+   DMA_FROM_DEVICE);
ret = usb_submit_urb(uvc_urb->urb, GFP_ATOMIC);
if (ret < 0)
uvc_printk(KERN_ERR,
@@ -1566,8 +1581,16 @@ static void uvc_free_urb_buffers(struct uvc_streaming 
*stream)
continue;
 
 #ifndef CONFIG_DMA_NONCOHERENT
-   usb_free_coherent(stream->dev->udev, stream->urb_size,
- uvc_urb->buffer, uvc_urb->dma);
+   if (uvc_urb->pages) {
+   sg_free_table(_urb->sgt);
+   vunmap(uvc_urb->buffer);
+   dma_free_noncontiguous(stream_to_dmadev(stream),
+  stream->urb_size,
+  uvc_urb->pages, uvc_urb->dma);
+   } else {
+   usb_free_coherent(stream->dev->udev, stream->urb_size,
+ uvc_urb->buffer, uvc_urb->dma);
+   }
 #else
kfree(uvc_urb->buffer);
 #endif
@@ -1577,6 +1600,56 @@ static void uvc_free_urb_buffers(struct uvc_streaming 
*stream)
stream->urb_size = 0;
 }
 
+#ifndef CONFIG_DMA_NONCOHERENT
+static bool uvc_alloc_urb_buffer(struct uvc_streaming *stream,
+struct uvc_urb *uvc_urb, gfp_t gfp_flags)
+{
+   struct device *dma_dev = dma_dev = stream_to_dmadev(stream);
+
+   if (!dma_can_alloc_noncontiguous(dma_dev)) {
+   uvc_urb->buffer = usb_alloc_coherent(stream->dev->udev,
+stream->urb_size,
+gfp_flags | __GFP_NOWARN,
+_urb->dma);
+   return uvc_urb->buffer != NULL;
+   }
+
+   uvc_urb->pages = dma_alloc_noncontiguous(dma_dev, stream->urb_size,
+_urb->dma,
+gfp_flags | __GFP_NOWARN, 0);
+   if (!uvc_urb->pages)
+   return false;
+
+   uvc_urb->buffer = vmap(uvc_urb->pages,
+  PAGE_ALIGN(stream->urb_size) >> PAGE_SHIFT,
+  VM_DMA_COHERENT, PAGE_KERNEL);
+   if (!uvc_urb->buffer) {
+   dma_free_noncontiguous(dma_dev, stream->urb_size,
+  uvc_urb->pages, uvc_urb->dma);
+ 

[PATCH v2 5/6] media: uvcvideo: Use dma_alloc_noncontiguos API

2020-11-25 Thread Ricardo Ribalda
On architectures where the is no coherent caching such as ARM use the
dma_alloc_noncontiguos API and handle manually the cache flushing using
dma_sync_sg().

With this patch on the affected architectures we can measure up to 20x
performance improvement in uvc_video_copy_data_work().

Signed-off-by: Ricardo Ribalda 
---

v2: Thanks to Robin!

Use dma_sync_sg instead of dma_sync_single


 drivers/media/usb/uvc/uvc_video.c | 83 ++-
 drivers/media/usb/uvc/uvcvideo.h  |  2 +
 2 files changed, 73 insertions(+), 12 deletions(-)

diff --git a/drivers/media/usb/uvc/uvc_video.c 
b/drivers/media/usb/uvc/uvc_video.c
index a6a441d92b94..b2e6a9522999 100644
--- a/drivers/media/usb/uvc/uvc_video.c
+++ b/drivers/media/usb/uvc/uvc_video.c
@@ -1490,6 +1490,11 @@ static void uvc_video_encode_bulk(struct uvc_urb 
*uvc_urb,
urb->transfer_buffer_length = stream->urb_size - len;
 }
 
+static inline struct device *stream_to_dmadev(struct uvc_streaming *stream)
+{
+   return stream->dev->udev->bus->controller->parent;
+}
+
 static void uvc_video_complete(struct urb *urb)
 {
struct uvc_urb *uvc_urb = urb->context;
@@ -1539,6 +1544,10 @@ static void uvc_video_complete(struct urb *urb)
 * Process the URB headers, and optionally queue expensive memcpy tasks
 * to be deferred to a work queue.
 */
+   if (uvc_urb->pages) {
+   dma_sync_sg_for_cpu(stream_to_dmadev(stream), uvc_urb->sgt.sgl,
+   uvc_urb->sgt.nents, DMA_FROM_DEVICE);
+   }
stream->decode(uvc_urb, buf, buf_meta);
 
/* If no async work is needed, resubmit the URB immediately. */
@@ -1566,8 +1575,16 @@ static void uvc_free_urb_buffers(struct uvc_streaming 
*stream)
continue;
 
 #ifndef CONFIG_DMA_NONCOHERENT
-   usb_free_coherent(stream->dev->udev, stream->urb_size,
- uvc_urb->buffer, uvc_urb->dma);
+   if (uvc_urb->pages) {
+   sg_free_table(_urb->sgt);
+   vunmap(uvc_urb->buffer);
+   dma_free_noncontiguous(stream_to_dmadev(stream),
+  stream->urb_size,
+  uvc_urb->pages, uvc_urb->dma);
+   } else {
+   usb_free_coherent(stream->dev->udev, stream->urb_size,
+ uvc_urb->buffer, uvc_urb->dma);
+   }
 #else
kfree(uvc_urb->buffer);
 #endif
@@ -1577,6 +1594,56 @@ static void uvc_free_urb_buffers(struct uvc_streaming 
*stream)
stream->urb_size = 0;
 }
 
+#ifndef CONFIG_DMA_NONCOHERENT
+static bool uvc_alloc_urb_buffer(struct uvc_streaming *stream,
+struct uvc_urb *uvc_urb, gfp_t gfp_flags)
+{
+   struct device *dma_dev = dma_dev = stream_to_dmadev(stream);
+
+   if (!dma_can_alloc_noncontiguous(dma_dev)) {
+   uvc_urb->buffer = usb_alloc_coherent(stream->dev->udev,
+stream->urb_size,
+gfp_flags | __GFP_NOWARN,
+_urb->dma);
+   return uvc_urb->buffer != NULL;
+   }
+
+   uvc_urb->pages = dma_alloc_noncontiguous(dma_dev, stream->urb_size,
+_urb->dma,
+gfp_flags | __GFP_NOWARN, 0);
+   if (!uvc_urb->pages)
+   return false;
+
+   uvc_urb->buffer = vmap(uvc_urb->pages,
+  PAGE_ALIGN(stream->urb_size) >> PAGE_SHIFT,
+  VM_DMA_COHERENT, PAGE_KERNEL);
+   if (!uvc_urb->buffer) {
+   dma_free_noncontiguous(dma_dev, stream->urb_size,
+  uvc_urb->pages, uvc_urb->dma);
+   return false;
+   }
+
+   if (sg_alloc_table_from_pages(_urb->sgt, uvc_urb->pages,
+   PAGE_ALIGN(stream->urb_size) >> PAGE_SHIFT, 0,
+   stream->urb_size, GFP_KERNEL)) {
+   vunmap(uvc_urb->buffer);
+   dma_free_noncontiguous(dma_dev, stream->urb_size,
+  uvc_urb->pages, uvc_urb->dma);
+   return false;
+   }
+
+   return true;
+}
+#else
+static bool uvc_alloc_urb_buffer(struct uvc_streaming *stream,
+struct uvc_urb *uvc_urb, gfp_t gfp_flags)
+{
+   uvc_urb->buffer = kmalloc(stream->urb_size, gfp_flags | __GFP_NOWARN);
+
+   return uvc_urb->buffer != NULL;
+}
+#endif
+
 /*
  * Allocate transfer buffers. This function can be called with 

Re: [PATCH 5/6] media: uvcvideo: Use dma_alloc_noncontiguos API

2020-11-24 Thread Ricardo Ribalda
Hi Robin

On Tue, Nov 24, 2020 at 5:29 PM Robin Murphy  wrote:
>
> On 2020-11-24 15:38, Ricardo Ribalda wrote:
> > On architectures where the is no coherent caching such as ARM use the
> > dma_alloc_noncontiguos API and handle manually the cache flushing using
> > dma_sync_single().
> >
> > With this patch on the affected architectures we can measure up to 20x
> > performance improvement in uvc_video_copy_data_work().
> >
> > Signed-off-by: Ricardo Ribalda 
> > ---
> >   drivers/media/usb/uvc/uvc_video.c | 74 ++-
> >   drivers/media/usb/uvc/uvcvideo.h  |  1 +
> >   2 files changed, 63 insertions(+), 12 deletions(-)
> >
> > diff --git a/drivers/media/usb/uvc/uvc_video.c 
> > b/drivers/media/usb/uvc/uvc_video.c
> > index a6a441d92b94..9e90b261428a 100644
> > --- a/drivers/media/usb/uvc/uvc_video.c
> > +++ b/drivers/media/usb/uvc/uvc_video.c
> > @@ -1490,6 +1490,11 @@ static void uvc_video_encode_bulk(struct uvc_urb 
> > *uvc_urb,
> >   urb->transfer_buffer_length = stream->urb_size - len;
> >   }
> >
> > +static inline struct device *stream_to_dmadev(struct uvc_streaming *stream)
> > +{
> > + return stream->dev->udev->bus->controller->parent;
> > +}
> > +
> >   static void uvc_video_complete(struct urb *urb)
> >   {
> >   struct uvc_urb *uvc_urb = urb->context;
> > @@ -1539,6 +1544,11 @@ static void uvc_video_complete(struct urb *urb)
> >* Process the URB headers, and optionally queue expensive memcpy 
> > tasks
> >* to be deferred to a work queue.
> >*/
> > + if (uvc_urb->pages)
> > + dma_sync_single_for_cpu(stream_to_dmadev(stream),
> > + urb->transfer_dma,
> > + urb->transfer_buffer_length,
> > + DMA_FROM_DEVICE);
>
> This doesn't work. Even in iommu-dma, the streaming API still expects to
> work on physically-contiguous memory that could have been passed to
> dma_map_single() in the first place. As-is, this will invalidate
> transfer_buffer_length bytes from the start of the *first* physical
> page, and thus destroy random other data if lines from subsequent
> unrelated pages are dirty in caches.
>
> The only feasible way to do a DMA sync on disjoint pages in a single
> call is with a scatterlist.

Thanks for pointing this out. I guess I was lucky on my hardware and
the areas were always  contiguous.

Will rework and send back to the list.

Thanks again.

>
> Robin.
>
> >   stream->decode(uvc_urb, buf, buf_meta);
> >
> >   /* If no async work is needed, resubmit the URB immediately. */
> > @@ -1566,8 +1576,15 @@ static void uvc_free_urb_buffers(struct 
> > uvc_streaming *stream)
> >   continue;
> >
> >   #ifndef CONFIG_DMA_NONCOHERENT
> > - usb_free_coherent(stream->dev->udev, stream->urb_size,
> > -   uvc_urb->buffer, uvc_urb->dma);
> > + if (uvc_urb->pages) {
> > + vunmap(uvc_urb->buffer);
> > + dma_free_noncontiguous(stream_to_dmadev(stream),
> > +stream->urb_size,
> > +uvc_urb->pages, uvc_urb->dma);
> > + } else {
> > + usb_free_coherent(stream->dev->udev, stream->urb_size,
> > +   uvc_urb->buffer, uvc_urb->dma);
> > + }
> >   #else
> >   kfree(uvc_urb->buffer);
> >   #endif
> > @@ -1577,6 +1594,47 @@ static void uvc_free_urb_buffers(struct 
> > uvc_streaming *stream)
> >   stream->urb_size = 0;
> >   }
> >
> > +#ifndef CONFIG_DMA_NONCOHERENT
> > +static bool uvc_alloc_urb_buffer(struct uvc_streaming *stream,
> > +  struct uvc_urb *uvc_urb, gfp_t gfp_flags)
> > +{
> > + struct device *dma_dev = dma_dev = stream_to_dmadev(stream);
> > +
> > + if (!dma_can_alloc_noncontiguous(dma_dev)) {
> > + uvc_urb->buffer = usb_alloc_coherent(stream->dev->udev,
> > +  stream->urb_size,
> > +  gfp_flags | __GFP_NOWARN,
> > +  _urb->dma);
> > + return uvc_urb->buffer != NULL;
> > + }
> > +
> >

[PATCH 6/6] TEST-ONLY: media: uvcvideo: Add statistics for measuring performance

2020-11-24 Thread Ricardo Ribalda
From: Shik Chen 

Majorly based on [1], with the following tweaks:

* Use div_u64 for u64 divisions
* Calculate standard deviation
* Fix an uninitialized |min| field for header
* Apply clang-format

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/kbingham/rcar.git/commit/?h=uvc/async-ml=cebbd1b629bbe5f856ec5dc7591478c003f5a944

Signed-off-by: Shik Chen 
---
 drivers/media/usb/uvc/uvc_video.c | 163 +-
 drivers/media/usb/uvc/uvcvideo.h  |  21 
 2 files changed, 181 insertions(+), 3 deletions(-)

diff --git a/drivers/media/usb/uvc/uvc_video.c 
b/drivers/media/usb/uvc/uvc_video.c
index 9e90b261428a..d3a515015003 100644
--- a/drivers/media/usb/uvc/uvc_video.c
+++ b/drivers/media/usb/uvc/uvc_video.c
@@ -906,12 +906,61 @@ static void uvc_video_stats_update(struct uvc_streaming 
*stream)
memset(>stats.frame, 0, sizeof(stream->stats.frame));
 }
 
+size_t uvc_video_dump_time_stats(char *buf, size_t size,
+struct uvc_stats_time *stat, const char *pfx)
+{
+   unsigned int avg = 0;
+   unsigned int std = 0;
+
+   if (stat->qty) {
+   avg = div_u64(stat->duration, stat->qty);
+   std = int_sqrt64(div_u64(stat->duration2, stat->qty) -
+avg * avg);
+   }
+
+   /* Stat durations are in nanoseconds, we present in micro-seconds */
+   return scnprintf(
+   buf, size,
+   "%s: %llu/%u uS/qty: %u.%03u avg %u.%03u std %u.%03u min 
%u.%03u max (uS)\n",
+   pfx, div_u64(stat->duration, 1000), stat->qty, avg / 1000,
+   avg % 1000, std / 1000, std % 1000, stat->min / 1000,
+   stat->min % 1000, stat->max / 1000, stat->max % 1000);
+}
+
+size_t uvc_video_dump_speed(char *buf, size_t size, const char *pfx, u64 bytes,
+   u64 milliseconds)
+{
+   unsigned int rate = 0;
+   bool gbit = false;
+
+   if (milliseconds)
+   rate = div_u64(bytes * 8, milliseconds);
+
+   if (rate >= 100) {
+   gbit = true;
+   rate /= 1000;
+   }
+
+   /*
+* bits/milliseconds == kilobits/seconds,
+* presented here as Mbits/s (or Gbit/s) with 3 decimal places
+*/
+   return scnprintf(buf, size, "%s: %d.%03d %sbits/s\n", pfx, rate / 1000,
+rate % 1000, gbit ? "G" : "M");
+}
+
 size_t uvc_video_stats_dump(struct uvc_streaming *stream, char *buf,
size_t size)
 {
+   u64 bytes = stream->stats.stream.bytes; /* Single sample */
+   unsigned int empty_ratio = 0;
unsigned int scr_sof_freq;
unsigned int duration;
+   unsigned int fps = 0;
size_t count = 0;
+   u64 cpu = 0;
+   u64 cpu_q = 0;
+   u32 cpu_r = 0;
 
/* Compute the SCR.SOF frequency estimate. At the nominal 1kHz SOF
 * frequency this will not overflow before more than 1h.
@@ -924,12 +973,19 @@ size_t uvc_video_stats_dump(struct uvc_streaming *stream, 
char *buf,
else
scr_sof_freq = 0;
 
+   if (stream->stats.stream.nb_packets)
+   empty_ratio = stream->stats.stream.nb_empty * 100 /
+ stream->stats.stream.nb_packets;
+
count += scnprintf(buf + count, size - count,
-  "frames:  %u\npackets: %u\nempty:   %u\n"
-  "errors:  %u\ninvalid: %u\n",
+  "frames:  %u\n"
+  "packets: %u\n"
+  "empty:   %u (%u %%)\n"
+  "errors:  %u\n"
+  "invalid: %u\n",
   stream->stats.stream.nb_frames,
   stream->stats.stream.nb_packets,
-  stream->stats.stream.nb_empty,
+  stream->stats.stream.nb_empty, empty_ratio,
   stream->stats.stream.nb_errors,
   stream->stats.stream.nb_invalid);
count += scnprintf(buf + count, size - count,
@@ -946,6 +1002,55 @@ size_t uvc_video_stats_dump(struct uvc_streaming *stream, 
char *buf,
   stream->stats.stream.min_sof,
   stream->stats.stream.max_sof,
   scr_sof_freq / 1000, scr_sof_freq % 1000);
+   count += scnprintf(buf + count, size - count,
+  "bytes %lld : duration %d\n", bytes, duration);
+
+   if (duration != 0) {
+   /* Duration is in milliseconds, * 100 to gain 2 dp precision */
+   fps = stream->stats.stream.nb_frames * 1000 * 100 / duration;
+   /* CPU usage as a % with 6 decimal places */
+   cpu = div_u64(stream->stats.urbstat.decode.duration, duration) *
+ 100;
+   }
+
+   count += scnprintf(buf + count, size - count, "FPS: %u.%02u\n",
+  fps / 100, 

[PATCH 5/6] media: uvcvideo: Use dma_alloc_noncontiguos API

2020-11-24 Thread Ricardo Ribalda
On architectures where the is no coherent caching such as ARM use the
dma_alloc_noncontiguos API and handle manually the cache flushing using
dma_sync_single().

With this patch on the affected architectures we can measure up to 20x
performance improvement in uvc_video_copy_data_work().

Signed-off-by: Ricardo Ribalda 
---
 drivers/media/usb/uvc/uvc_video.c | 74 ++-
 drivers/media/usb/uvc/uvcvideo.h  |  1 +
 2 files changed, 63 insertions(+), 12 deletions(-)

diff --git a/drivers/media/usb/uvc/uvc_video.c 
b/drivers/media/usb/uvc/uvc_video.c
index a6a441d92b94..9e90b261428a 100644
--- a/drivers/media/usb/uvc/uvc_video.c
+++ b/drivers/media/usb/uvc/uvc_video.c
@@ -1490,6 +1490,11 @@ static void uvc_video_encode_bulk(struct uvc_urb 
*uvc_urb,
urb->transfer_buffer_length = stream->urb_size - len;
 }
 
+static inline struct device *stream_to_dmadev(struct uvc_streaming *stream)
+{
+   return stream->dev->udev->bus->controller->parent;
+}
+
 static void uvc_video_complete(struct urb *urb)
 {
struct uvc_urb *uvc_urb = urb->context;
@@ -1539,6 +1544,11 @@ static void uvc_video_complete(struct urb *urb)
 * Process the URB headers, and optionally queue expensive memcpy tasks
 * to be deferred to a work queue.
 */
+   if (uvc_urb->pages)
+   dma_sync_single_for_cpu(stream_to_dmadev(stream),
+   urb->transfer_dma,
+   urb->transfer_buffer_length,
+   DMA_FROM_DEVICE);
stream->decode(uvc_urb, buf, buf_meta);
 
/* If no async work is needed, resubmit the URB immediately. */
@@ -1566,8 +1576,15 @@ static void uvc_free_urb_buffers(struct uvc_streaming 
*stream)
continue;
 
 #ifndef CONFIG_DMA_NONCOHERENT
-   usb_free_coherent(stream->dev->udev, stream->urb_size,
- uvc_urb->buffer, uvc_urb->dma);
+   if (uvc_urb->pages) {
+   vunmap(uvc_urb->buffer);
+   dma_free_noncontiguous(stream_to_dmadev(stream),
+  stream->urb_size,
+  uvc_urb->pages, uvc_urb->dma);
+   } else {
+   usb_free_coherent(stream->dev->udev, stream->urb_size,
+ uvc_urb->buffer, uvc_urb->dma);
+   }
 #else
kfree(uvc_urb->buffer);
 #endif
@@ -1577,6 +1594,47 @@ static void uvc_free_urb_buffers(struct uvc_streaming 
*stream)
stream->urb_size = 0;
 }
 
+#ifndef CONFIG_DMA_NONCOHERENT
+static bool uvc_alloc_urb_buffer(struct uvc_streaming *stream,
+struct uvc_urb *uvc_urb, gfp_t gfp_flags)
+{
+   struct device *dma_dev = dma_dev = stream_to_dmadev(stream);
+
+   if (!dma_can_alloc_noncontiguous(dma_dev)) {
+   uvc_urb->buffer = usb_alloc_coherent(stream->dev->udev,
+stream->urb_size,
+gfp_flags | __GFP_NOWARN,
+_urb->dma);
+   return uvc_urb->buffer != NULL;
+   }
+
+   uvc_urb->pages = dma_alloc_noncontiguous(dma_dev, stream->urb_size,
+_urb->dma,
+gfp_flags | __GFP_NOWARN, 0);
+   if (!uvc_urb->pages)
+   return false;
+
+   uvc_urb->buffer = vmap(uvc_urb->pages,
+  PAGE_ALIGN(stream->urb_size) >> PAGE_SHIFT,
+  VM_DMA_COHERENT, PAGE_KERNEL);
+   if (!uvc_urb->buffer) {
+   dma_free_noncontiguous(dma_dev, stream->urb_size,
+  uvc_urb->pages, uvc_urb->dma);
+   return false;
+   }
+
+   return true;
+}
+#else
+static bool uvc_alloc_urb_buffer(struct uvc_streaming *stream,
+struct uvc_urb *uvc_urb, gfp_t gfp_flags)
+{
+   uvc_urb->buffer = kmalloc(stream->urb_size, gfp_flags | __GFP_NOWARN);
+
+   return uvc_urb->buffer != NULL;
+}
+#endif
+
 /*
  * Allocate transfer buffers. This function can be called with buffers
  * already allocated when resuming from suspend, in which case it will
@@ -1607,19 +1665,11 @@ static int uvc_alloc_urb_buffers(struct uvc_streaming 
*stream,
 
/* Retry allocations until one succeed. */
for (; npackets > 1; npackets /= 2) {
+   stream->urb_size = psize * npackets;
for (i = 0; i < UVC_URBS; ++i) {
struct uvc_urb *uvc_urb = >uvc_urb[i];
 
-   stream->urb_

[PATCH 3/6] dma-iommu: remove __iommu_dma_mmap

2020-11-24 Thread Ricardo Ribalda
From: Christoph Hellwig 

The function has a single caller, so open code it there and take
advantage of the precalculated page count variable.

Signed-off-by: Christoph Hellwig 
---
 drivers/iommu/dma-iommu.c | 17 +
 1 file changed, 1 insertion(+), 16 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 73249732afd3..a2fb92de7e3d 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -655,21 +655,6 @@ static void *iommu_dma_alloc_remap(struct device *dev, 
size_t size,
return NULL;
 }
 
-/**
- * __iommu_dma_mmap - Map a buffer into provided user VMA
- * @pages: Array representing buffer from __iommu_dma_alloc()
- * @size: Size of buffer in bytes
- * @vma: VMA describing requested userspace mapping
- *
- * Maps the pages of the buffer in @pages into @vma. The caller is responsible
- * for verifying the correct size and protection of @vma beforehand.
- */
-static int __iommu_dma_mmap(struct page **pages, size_t size,
-   struct vm_area_struct *vma)
-{
-   return vm_map_pages(vma, pages, PAGE_ALIGN(size) >> PAGE_SHIFT);
-}
-
 static void iommu_dma_sync_single_for_cpu(struct device *dev,
dma_addr_t dma_handle, size_t size, enum dma_data_direction dir)
 {
@@ -1074,7 +1059,7 @@ static int iommu_dma_mmap(struct device *dev, struct 
vm_area_struct *vma,
struct page **pages = dma_common_find_pages(cpu_addr);
 
if (pages)
-   return __iommu_dma_mmap(pages, size, vma);
+   return vm_map_pages(vma, pages, nr_pages);
pfn = vmalloc_to_pfn(cpu_addr);
} else {
pfn = page_to_pfn(virt_to_page(cpu_addr));
-- 
2.29.2.454.gaff20da3a2-goog

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 4/6] WIP: add a dma_alloc_contiguous API

2020-11-24 Thread Ricardo Ribalda
From: Christoph Hellwig 

Add a new API that returns a virtually non-contigous array of pages
and dma address.  This API is only implemented for dma-iommu and will
not be implemented for non-iommu DMA API instances that have to allocate
contiguous memory.  It is up to the caller to check if the API is
available.

The intent is that media drivers can use this API if either:

 - no kernel mapping or only temporary kernel mappings are required.
   That is as a better replacement for DMA_ATTR_NO_KERNEL_MAPPING
 - a kernel mapping is required for cached and DMA mapped pages, but
   the driver also needs the pages to e.g. map them to userspace.
   In that sense it is a replacement for some aspects of the recently
   removed and never fully implemented DMA_ATTR_NON_CONSISTENT

Signed-off-by: Christoph Hellwig 
---
 drivers/iommu/dma-iommu.c   | 73 +
 include/linux/dma-map-ops.h |  4 ++
 include/linux/dma-mapping.h |  5 +++
 kernel/dma/mapping.c| 35 ++
 4 files changed, 93 insertions(+), 24 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index a2fb92de7e3d..2e72fe1b9c3b 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -564,23 +564,12 @@ static struct page **__iommu_dma_alloc_pages(struct 
device *dev,
return pages;
 }
 
-/**
- * iommu_dma_alloc_remap - Allocate and map a buffer contiguous in IOVA space
- * @dev: Device to allocate memory for. Must be a real device
- *  attached to an iommu_dma_domain
- * @size: Size of buffer in bytes
- * @dma_handle: Out argument for allocated DMA handle
- * @gfp: Allocation flags
- * @prot: pgprot_t to use for the remapped mapping
- * @attrs: DMA attributes for this allocation
- *
- * If @size is less than PAGE_SIZE, then a full CPU page will be allocated,
+/*
+ * If size is less than PAGE_SIZE, then a full CPU page will be allocated,
  * but an IOMMU which supports smaller pages might not map the whole thing.
- *
- * Return: Mapped virtual address, or NULL on failure.
  */
-static void *iommu_dma_alloc_remap(struct device *dev, size_t size,
-   dma_addr_t *dma_handle, gfp_t gfp, pgprot_t prot,
+static struct page **__iommu_dma_alloc_noncontiguous(struct device *dev,
+   size_t size, dma_addr_t *dma_handle, gfp_t gfp, pgprot_t prot,
unsigned long attrs)
 {
struct iommu_domain *domain = iommu_get_dma_domain(dev);
@@ -592,7 +581,6 @@ static void *iommu_dma_alloc_remap(struct device *dev, 
size_t size,
struct page **pages;
struct sg_table sgt;
dma_addr_t iova;
-   void *vaddr;
 
*dma_handle = DMA_MAPPING_ERROR;
 
@@ -635,17 +623,10 @@ static void *iommu_dma_alloc_remap(struct device *dev, 
size_t size,
< size)
goto out_free_sg;
 
-   vaddr = dma_common_pages_remap(pages, size, prot,
-   __builtin_return_address(0));
-   if (!vaddr)
-   goto out_unmap;
-
*dma_handle = iova;
sg_free_table();
-   return vaddr;
+   return pages;
 
-out_unmap:
-   __iommu_dma_unmap(dev, iova, size);
 out_free_sg:
sg_free_table();
 out_free_iova:
@@ -655,6 +636,46 @@ static void *iommu_dma_alloc_remap(struct device *dev, 
size_t size,
return NULL;
 }
 
+static void *iommu_dma_alloc_remap(struct device *dev, size_t size,
+   dma_addr_t *dma_handle, gfp_t gfp, pgprot_t prot,
+   unsigned long attrs)
+{
+   struct page **pages;
+   void *vaddr;
+
+   pages = __iommu_dma_alloc_noncontiguous(dev, size, dma_handle, gfp,
+   prot, attrs);
+   if (!pages)
+   return NULL;
+   vaddr = dma_common_pages_remap(pages, size, prot,
+   __builtin_return_address(0));
+   if (!vaddr)
+   goto out_unmap;
+   return vaddr;
+
+out_unmap:
+   __iommu_dma_unmap(dev, *dma_handle, size);
+   __iommu_dma_free_pages(pages, PAGE_ALIGN(size) >> PAGE_SHIFT);
+   return NULL;
+}
+
+#ifdef CONFIG_DMA_REMAP
+static struct page **iommu_dma_alloc_noncontiguous(struct device *dev,
+   size_t size, dma_addr_t *dma_handle, gfp_t gfp,
+   unsigned long attrs)
+{
+   return __iommu_dma_alloc_noncontiguous(dev, size, dma_handle, gfp,
+  PAGE_KERNEL, attrs);
+}
+
+static void iommu_dma_free_noncontiguous(struct device *dev, size_t size,
+   struct page **pages, dma_addr_t dma_handle)
+{
+   __iommu_dma_unmap(dev, dma_handle, size);
+   __iommu_dma_free_pages(pages, PAGE_ALIGN(size) >> PAGE_SHIFT);
+}
+#endif
+
 static void iommu_dma_sync_single_for_cpu(struct device *dev,
dma_addr_t dma_handle, size_t size, enum dma_data_direction dir)
 {
@@ -1109,6 +1130,10 @@ static const struct dma_map_ops iommu_dma_ops = {
.free   = 

[PATCH 2/6] dma-direct: use __GFP_ZERO in dma_direct_alloc_pages

2020-11-24 Thread Ricardo Ribalda
From: Christoph Hellwig 

Prepare for supporting the DMA_ATTR_NO_KERNEL_MAPPING flag in
dma_alloc_pages.

Signed-off-by: Christoph Hellwig 
---
 kernel/dma/direct.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 06c111544f61..76c741e610fc 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -280,13 +280,12 @@ struct page *dma_direct_alloc_pages(struct device *dev, 
size_t size,
dma_addr_t *dma_handle, enum dma_data_direction dir, gfp_t gfp)
 {
struct page *page;
-   void *ret;
 
if (IS_ENABLED(CONFIG_DMA_COHERENT_POOL) &&
force_dma_unencrypted(dev) && !gfpflags_allow_blocking(gfp))
return dma_direct_alloc_from_pool(dev, size, dma_handle, gfp);
 
-   page = __dma_direct_alloc_pages(dev, size, gfp);
+   page = __dma_direct_alloc_pages(dev, size, gfp | __GFP_ZERO);
if (!page)
return NULL;
if (PageHighMem(page)) {
@@ -300,13 +299,11 @@ struct page *dma_direct_alloc_pages(struct device *dev, 
size_t size,
goto out_free_pages;
}
 
-   ret = page_address(page);
if (force_dma_unencrypted(dev)) {
-   if (set_memory_decrypted((unsigned long)ret,
+   if (set_memory_decrypted((unsigned long) page_address(page),
1 << get_order(size)))
goto out_free_pages;
}
-   memset(ret, 0, size);
*dma_handle = phys_to_dma_direct(dev, page_to_phys(page));
return page;
 out_free_pages:
-- 
2.29.2.454.gaff20da3a2-goog

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 1/6] dma-mapping: remove the {alloc,free}_noncoherent methods

2020-11-24 Thread Ricardo Ribalda
From: Christoph Hellwig 

It turns out allowing non-contigous allocations here was a rather bad
idea, as we'll now need to define ways to get the pages for mmaping
or dma_buf sharing.  Revert this change and stick to the original
concept.  A different API for the use case of non-contigous allocations
will be added back later.

Signed-off-by: Christoph Hellwig 
---
 drivers/iommu/dma-iommu.c   | 30 --
 include/linux/dma-map-ops.h |  5 -
 kernel/dma/mapping.c| 33 ++---
 3 files changed, 6 insertions(+), 62 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 0cbcd3fc3e7e..73249732afd3 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1054,34 +1054,6 @@ static void *iommu_dma_alloc(struct device *dev, size_t 
size,
return cpu_addr;
 }
 
-#ifdef CONFIG_DMA_REMAP
-static void *iommu_dma_alloc_noncoherent(struct device *dev, size_t size,
-   dma_addr_t *handle, enum dma_data_direction dir, gfp_t gfp)
-{
-   if (!gfpflags_allow_blocking(gfp)) {
-   struct page *page;
-
-   page = dma_common_alloc_pages(dev, size, handle, dir, gfp);
-   if (!page)
-   return NULL;
-   return page_address(page);
-   }
-
-   return iommu_dma_alloc_remap(dev, size, handle, gfp | __GFP_ZERO,
-PAGE_KERNEL, 0);
-}
-
-static void iommu_dma_free_noncoherent(struct device *dev, size_t size,
-   void *cpu_addr, dma_addr_t handle, enum dma_data_direction dir)
-{
-   __iommu_dma_unmap(dev, handle, size);
-   __iommu_dma_free(dev, size, cpu_addr);
-}
-#else
-#define iommu_dma_alloc_noncoherentNULL
-#define iommu_dma_free_noncoherent NULL
-#endif /* CONFIG_DMA_REMAP */
-
 static int iommu_dma_mmap(struct device *dev, struct vm_area_struct *vma,
void *cpu_addr, dma_addr_t dma_addr, size_t size,
unsigned long attrs)
@@ -1152,8 +1124,6 @@ static const struct dma_map_ops iommu_dma_ops = {
.free   = iommu_dma_free,
.alloc_pages= dma_common_alloc_pages,
.free_pages = dma_common_free_pages,
-   .alloc_noncoherent  = iommu_dma_alloc_noncoherent,
-   .free_noncoherent   = iommu_dma_free_noncoherent,
.mmap   = iommu_dma_mmap,
.get_sgtable= iommu_dma_get_sgtable,
.map_page   = iommu_dma_map_page,
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index a5f89fc4d6df..3d1f91464bcf 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -22,11 +22,6 @@ struct dma_map_ops {
gfp_t gfp);
void (*free_pages)(struct device *dev, size_t size, struct page *vaddr,
dma_addr_t dma_handle, enum dma_data_direction dir);
-   void *(*alloc_noncoherent)(struct device *dev, size_t size,
-   dma_addr_t *dma_handle, enum dma_data_direction dir,
-   gfp_t gfp);
-   void (*free_noncoherent)(struct device *dev, size_t size, void *vaddr,
-   dma_addr_t dma_handle, enum dma_data_direction dir);
int (*mmap)(struct device *, struct vm_area_struct *,
void *, dma_addr_t, size_t, unsigned long attrs);
 
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 51bb8fa8eb89..d3032513c54b 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -514,40 +514,19 @@ EXPORT_SYMBOL_GPL(dma_free_pages);
 void *dma_alloc_noncoherent(struct device *dev, size_t size,
dma_addr_t *dma_handle, enum dma_data_direction dir, gfp_t gfp)
 {
-   const struct dma_map_ops *ops = get_dma_ops(dev);
-   void *vaddr;
-
-   if (!ops || !ops->alloc_noncoherent) {
-   struct page *page;
-
-   page = dma_alloc_pages(dev, size, dma_handle, dir, gfp);
-   if (!page)
-   return NULL;
-   return page_address(page);
-   }
+   struct page *page;
 
-   size = PAGE_ALIGN(size);
-   vaddr = ops->alloc_noncoherent(dev, size, dma_handle, dir, gfp);
-   if (vaddr)
-   debug_dma_map_page(dev, virt_to_page(vaddr), 0, size, dir,
-  *dma_handle);
-   return vaddr;
+   page = dma_alloc_pages(dev, size, dma_handle, dir, gfp);
+   if (!page)
+   return NULL;
+   return page_address(page);
 }
 EXPORT_SYMBOL_GPL(dma_alloc_noncoherent);
 
 void dma_free_noncoherent(struct device *dev, size_t size, void *vaddr,
dma_addr_t dma_handle, enum dma_data_direction dir)
 {
-   const struct dma_map_ops *ops = get_dma_ops(dev);
-
-   if (!ops || !ops->free_noncoherent) {
-   dma_free_pages(dev, size, virt_to_page(vaddr), dma_handle, dir);
- 

Re: [PATCH] WIP! media: uvcvideo: Use dma_alloc_noncontiguos API

2020-11-24 Thread Ricardo Ribalda
HI Christoph

On Tue, Nov 24, 2020 at 12:35 PM Christoph Hellwig  wrote:
>
> On Wed, Nov 18, 2020 at 03:25:46PM +0100, Ricardo Ribalda wrote:
> > On architectures where the is no coherent caching such as ARM use the
> > dma_alloc_noncontiguos API and handle manually the cache flushing using
> > dma_sync_single().
> >
> > With this patch on the affected architectures we can measure up to 20x
> > performance improvement in uvc_video_copy_data_work().
>
> This has a bunch of crazy long lines, but otherwise looks fine to me.

That is easy to solve :)

https://github.com/ribalda/linux/commit/17ab65a08302e845ad7ae7775ce54b387a58a887

>
> >
> > Signed-off-by: Ricardo Ribalda 
> > ---
> >
> > This patch depends on dma_alloc_contiguous API1315351diffmboxseries
>
> How do we want to proceed?  Do the media maintainers want to pick up
> that patch?  Should I pick up the media patch in the dma-mapping tree?

I was hoping that you could answer that question :).

Do you have other use-cases than linux-media in mind?

I think Sergey wants to experiment also with vb2, to figure out how
much it affects it.
His change will be much more complicated than mine thought, there are
more cornercases there.

>
> Can you respost a combined series to get started?

Sure. Shall I also include the profiling patch?


Best regards
-- 
Ricardo Ribalda
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH] WIP! media: uvcvideo: Use dma_alloc_noncontiguos API

2020-11-18 Thread Ricardo Ribalda
On architectures where the is no coherent caching such as ARM use the
dma_alloc_noncontiguos API and handle manually the cache flushing using
dma_sync_single().

With this patch on the affected architectures we can measure up to 20x
performance improvement in uvc_video_copy_data_work().

Signed-off-by: Ricardo Ribalda 
---

This patch depends on dma_alloc_contiguous API1315351diffmboxseries

https://lore.kernel.org/patchwork/patch/1315351/#1535182

 drivers/media/usb/uvc/uvc_video.c | 69 +--
 drivers/media/usb/uvc/uvcvideo.h  |  1 +
 2 files changed, 58 insertions(+), 12 deletions(-)

diff --git a/drivers/media/usb/uvc/uvc_video.c 
b/drivers/media/usb/uvc/uvc_video.c
index ff624bb857d3..ef1b029b8576 100644
--- a/drivers/media/usb/uvc/uvc_video.c
+++ b/drivers/media/usb/uvc/uvc_video.c
@@ -1641,6 +1641,11 @@ static void uvc_video_encode_bulk(struct uvc_urb 
*uvc_urb,
urb->transfer_buffer_length = stream->urb_size - len;
 }
 
+static inline struct device *stream_to_dmadev(struct uvc_streaming *stream)
+{
+   return stream->dev->udev->bus->controller->parent;
+}
+
 static void uvc_video_complete(struct urb *urb)
 {
struct uvc_urb *uvc_urb = urb->context;
@@ -1693,6 +1698,11 @@ static void uvc_video_complete(struct urb *urb)
 * Process the URB headers, and optionally queue expensive memcpy tasks
 * to be deferred to a work queue.
 */
+   if (uvc_urb->pages)
+   dma_sync_single_for_cpu(stream_to_dmadev(stream),
+   urb->transfer_dma,
+   urb->transfer_buffer_length,
+   DMA_FROM_DEVICE);
stream->decode(uvc_urb, buf, buf_meta);
 
/* If no async work is needed, resubmit the URB immediately. */
@@ -1723,8 +1733,15 @@ static void uvc_free_urb_buffers(struct uvc_streaming 
*stream)
continue;
 
 #ifndef CONFIG_DMA_NONCOHERENT
-   usb_free_coherent(stream->dev->udev, stream->urb_size,
- uvc_urb->buffer, uvc_urb->dma);
+   if (uvc_urb->pages) {
+   vunmap(uvc_urb->buffer);
+   dma_free_noncontiguous(stream_to_dmadev(stream),
+  stream->urb_size,
+  uvc_urb->pages, uvc_urb->dma);
+   } else {
+   usb_free_coherent(stream->dev->udev, stream->urb_size,
+ uvc_urb->buffer, uvc_urb->dma);
+   }
 #else
kfree(uvc_urb->buffer);
 #endif
@@ -1734,6 +1751,42 @@ static void uvc_free_urb_buffers(struct uvc_streaming 
*stream)
stream->urb_size = 0;
 }
 
+#ifndef CONFIG_DMA_NONCOHERENT
+static bool uvc_alloc_urb_buffer(struct uvc_streaming *stream, struct uvc_urb 
*uvc_urb,
+gfp_t gfp_flags)
+{
+   struct device *dma_dev = dma_dev = stream_to_dmadev(stream);
+
+   if (!dma_can_alloc_noncontiguous(dma_dev)) {
+   uvc_urb->buffer = usb_alloc_coherent(stream->dev->udev, 
stream->urb_size,
+gfp_flags | __GFP_NOWARN, 
_urb->dma);
+   return uvc_urb->buffer != NULL;
+   }
+
+   uvc_urb->pages = dma_alloc_noncontiguous(dma_dev, stream->urb_size,
+_urb->dma, gfp_flags | 
__GFP_NOWARN, 0);
+   if (!uvc_urb->pages)
+   return false;
+
+   uvc_urb->buffer = vmap(uvc_urb->pages, PAGE_ALIGN(stream->urb_size) >> 
PAGE_SHIFT,
+  VM_DMA_COHERENT, PAGE_KERNEL);
+   if (!uvc_urb->buffer) {
+   dma_free_noncontiguous(dma_dev, stream->urb_size, 
uvc_urb->pages, uvc_urb->dma);
+   return false;
+   }
+
+   return true;
+}
+#else
+static bool uvc_alloc_urb_buffer(struct uvc_streaming *stream, struct uvc_urb 
*uvc_urb,
+gfp_t gfp_flags)
+{
+   uvc_urb->buffer = kmalloc(stream->urb_size, gfp_flags | __GFP_NOWARN);
+
+   return uvc_urb->buffer != NULL;
+}
+#endif
+
 /*
  * Allocate transfer buffers. This function can be called with buffers
  * already allocated when resuming from suspend, in which case it will
@@ -1764,19 +1817,11 @@ static int uvc_alloc_urb_buffers(struct uvc_streaming 
*stream,
 
/* Retry allocations until one succeed. */
for (; npackets > 1; npackets /= 2) {
+   stream->urb_size = psize * npackets;
for (i = 0; i < UVC_URBS; ++i) {
struct uvc_urb *uvc_urb = >uvc_urb[i];
 
-   stream->urb_size = psize * npackets;
-#ifndef CONFIG_DMA_NONCOHERENT
-   uvc_ur

Re: [PATCH 8/8] WIP: add a dma_alloc_contiguous API

2020-11-17 Thread Ricardo Ribalda
Hi Christoph

I have been testing with real hardware on arm64 your patchset. And uvc
performs 20 times better using Kieran's test

https://github.com/ribalda/linux/tree/uvc-noncontiguous

These are the result of running   yavta --capture=1000


dma_alloc_noncontiguous

frames:  999
packets: 999
empty:   0 (0 %)
errors:  0
invalid: 0
pts: 0 early, 0 initial, 999 ok
scr: 0 count ok, 0 diff ok
sof: 2048 <= sof <= 0, freq 0.000 kHz
bytes 78466000 : duration 33303
FPS: 29.99
URB: 418105/5000 uS/qty: 83.621 avg 98.783 std 17.396 min 1264.688 max (uS)
header: 100040/5000 uS/qty: 20.008 avg 19.458 std 2.969 min 454.167 max (uS)
latency: 347653/5000 uS/qty: 69.530 avg 98.937 std 9.114 min 1256.875 max (uS)
decode: 70452/5000 uS/qty: 14.090 avg 11.547 std 6.146 min 271.510 max (uS)
raw decode speed: 8.967 Gbits/s
raw URB handling speed: 1.501 Gbits/s
throughput: 18.848 Mbits/s
URB decode CPU usage 0.211500 %


usb_alloc_coherent

frames:  999
packets: 999
empty:   0 (0 %)
errors:  0
invalid: 0
pts: 0 early, 0 initial, 999 ok
scr: 0 count ok, 0 diff ok
sof: 2048 <= sof <= 0, freq 0.000 kHz
bytes 70501712 : duration 33319
FPS: 29.98
URB: 1854128/5000 uS/qty: 370.825 avg 417.133 std 14.539 min 2875.760 max (uS)
header: 98765/5000 uS/qty: 19.753 avg 30.714 std 1.042 min 573.463 max (uS)
latency: 453316/5000 uS/qty: 90.663 avg 114.987 std 4.065 min 860.795 max (uS)
decode: 1400811/5000 uS/qty: 280.162 avg 330.786 std 6.305 min 2758.202 max (uS)
raw decode speed: 402.866 Mbits/s
raw URB handling speed: 304.214 Mbits/s
throughput: 16.927 Mbits/s
URB decode CPU usage 4.204200 %


Best regards

On Tue, Nov 10, 2020 at 10:57 AM Christoph Hellwig  wrote:
>
> On Tue, Nov 10, 2020 at 06:50:32PM +0900, Tomasz Figa wrote:
> > In what terms it doesn't actually work? Last time I checked some
> > platforms actually defined CONFIG_DMA_NONCOHERENT, so those would
> > instead use the kmalloc() + dma_map() path. I don't have any
> > background on why that was added and whether it needs to be preserved,
> > though. Kieran, Laurent, do you have any insight?
>
> CONFIG_DMA_NONCOHERENT is set on sh and mips for platforms that may
> support non-coherent DMA at compile time (but at least for mips that
> doesn't actually means this gets used).  Using that ifdef to decide
> on using usb_alloc_coherent vs letting the usb layer map the data
> seems at best odd, and if we are unlucky papering over a bug somewhere.



-- 
Ricardo Ribalda
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 8/8] WIP: add a dma_alloc_contiguous API

2020-11-10 Thread Ricardo Ribalda
Hi Christoph

On Tue, Nov 10, 2020 at 10:25 AM Christoph Hellwig  wrote:
>
> On Mon, Nov 09, 2020 at 03:53:55PM +0100, Ricardo Ribalda wrote:
> > Hi Christoph
> >
> > I have started now to give a try to your patchset. Sorry for the delay.
> >
> > For uvc I have prepared this patch:
> > https://github.com/ribalda/linux/commit/9094fe223fe38f8c8ff21366d893b43cbbdf0113
> >
> > I have tested successfully in a x86_64 noteboot..., yes I know there
> > is no change for that platform :).
> > I am trying to get hold of an arm device that can run the latest
> > kernel from upstream.
> >
> > On the meanwhile if you could take a look to the patch to verify that
> > this the way that you expect the drivers to use your api I would
> > appreciate it
>
> This looks pretty reaosnable.
>

Great

Also FYI, I managed to boot an ARM device with that tree. But I could
not test the uvc driver (it was a remote device with no usb device
attached)

Hopefully I will be able to test it for real this week.

Any suggestions for how to measure performance difference?

Thanks!

> Note that ifdef  CONFIG_DMA_NONCOHERENT in the old code doesn't actually
> work, as that option is an internal thing just for mips and sh..



-- 
Ricardo Ribalda
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 8/8] WIP: add a dma_alloc_contiguous API

2020-11-09 Thread Ricardo Ribalda
L_GPL(dma_free_noncoherent);
> >
> > +bool dma_can_alloc_noncontiguous(struct device *dev)
> > +{
> > +   const struct dma_map_ops *ops = get_dma_ops(dev);
> > +
> > +   return ops && ops->free_noncontiguous;
> > +}
> > +EXPORT_SYMBOL_GPL(dma_can_alloc_noncontiguous);
> > +
> > +struct page **dma_alloc_noncontiguous(struct device *dev, size_t size,
> > +   dma_addr_t *dma_handle, gfp_t gfp, unsigned long attrs)
> > +{
> > +   const struct dma_map_ops *ops = get_dma_ops(dev);
> > +
> > +   if (WARN_ON_ONCE(!dma_can_alloc_noncontiguous(dev)))
> > +   return NULL;
> > +   if (attrs & ~DMA_ATTR_ALLOC_SINGLE_PAGES) {
> > +   dev_warn(dev, "invalid flags (0x%lx) for %s\n",
> > +attrs, __func__);
> > +   return NULL;
> > +   }
> > +   return ops->alloc_noncontiguous(dev, size, dma_handle, gfp, attrs);
> > +}
> > +EXPORT_SYMBOL_GPL(dma_alloc_noncontiguous);
> > +
> > +void dma_free_noncontiguous(struct device *dev, size_t size,
> > +   struct page **pages, dma_addr_t dma_handle)
> > +{
> > +   const struct dma_map_ops *ops = get_dma_ops(dev);
> > +
> > +   if (WARN_ON_ONCE(!dma_can_alloc_noncontiguous(dev)))
> > +   return;
> > +   ops->free_noncontiguous(dev, size, pages, dma_handle);
> > +}
> > +EXPORT_SYMBOL_GPL(dma_free_noncontiguous);
> > +
> >  int dma_supported(struct device *dev, u64 mask)
> >  {
> > const struct dma_map_ops *ops = get_dma_ops(dev);
> > --
> > 2.28.0
> >



-- 
Ricardo Ribalda
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu