Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-06-16 Thread Gwenole Beauchesne
Hi,

2015-06-16 10:35 GMT+02:00 Stefano Sabatini stefa...@gmail.com:
 On date Tuesday 2015-06-16 10:20:31 +0200, wm4 encoded:
 On Mon, 15 Jun 2015 17:55:35 +0200
 Stefano Sabatini stefa...@gmail.com wrote:

  On date Monday 2015-06-15 11:56:13 +0200, Stefano Sabatini encoded:
  [...]
   From 3a75ef1e86360cd6f30b8e550307404d0d1c1dba Mon Sep 17 00:00:00 2001
   From: Stefano Sabatini stefa...@gmail.com
   Date: Mon, 15 Jun 2015 11:02:50 +0200
   Subject: [PATCH] lavu/mem: add av_memcpynt() function with x86 
   optimizations
  
   Assembly based on code from vlc dxva2.c, commit 62107e56 by Laurent Aimar
   fen...@videolan.org.
  
   TODO: bump minor, update APIchanges
   ---
libavutil/mem.c  |  9 +
libavutil/mem.h  | 14 
libavutil/mem_internal.h | 26 +++
libavutil/x86/Makefile   |  1 +
libavutil/x86/mem.c  | 85 
   
5 files changed, 135 insertions(+)
create mode 100644 libavutil/mem_internal.h
create mode 100644 libavutil/x86/mem.c
  
   diff --git a/libavutil/mem.c b/libavutil/mem.c
   index da291fb..0e1eb01 100644
   --- a/libavutil/mem.c
   +++ b/libavutil/mem.c
   @@ -42,6 +42,7 @@
#include dynarray.h
#include intreadwrite.h
#include mem.h
   +#include mem_internal.h
  
#ifdef MALLOC_PREFIX
  
   @@ -515,3 +516,11 @@ void av_fast_malloc(void *ptr, unsigned int *size, 
   size_t min_size)
ff_fast_malloc(ptr, size, min_size, 0);
}
  
   +void av_memcpynt(void *dst, const void *src, size_t size, int cpu_flags)
   +{
   +#if ARCH_X86
   +ff_memcpynt_x86(dst, src, size, cpu_flags);
   +#else
   +memcpy(dst, src, size, cpu_flags);
   +#endif
   +}
 
  Alternatively, what about something like:
 
  av_memcpynt_fn av_memcpynt_get_fn(void);
 
  modeled after av_pixelutils_get_sad_fn()? This would skip the need for
  a wrapper calling the right function.


 I don't see much value in this, unless determining the right function
 causes too much overhead.

 I see two advantages, 1. no branch and function call when the function
 is called, 2. the cpu_flags must not be passed around, so it's somehow
 safer.

Interesting approach. You probably could also use something similar to
sws context you build up based on surface size, and other
characteristics (flags)?

Regards,
-- 
Gwenole Beauchesne
Intel Corporation SAS / 2 rue de Paris, 92196 Meudon Cedex, France
Registration Number (RCS): Nanterre B 302 456 199
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-06-16 Thread Gwenole Beauchesne
Hi,

2015-06-16 14:03 GMT+02:00 Michael Niedermayer michae...@gmx.at:
 On Tue, Jun 16, 2015 at 10:35:52AM +0200, Stefano Sabatini wrote:
 On date Tuesday 2015-06-16 10:20:31 +0200, wm4 encoded:
  On Mon, 15 Jun 2015 17:55:35 +0200
  Stefano Sabatini stefa...@gmail.com wrote:
 
   On date Monday 2015-06-15 11:56:13 +0200, Stefano Sabatini encoded:
   [...]
From 3a75ef1e86360cd6f30b8e550307404d0d1c1dba Mon Sep 17 00:00:00 2001
From: Stefano Sabatini stefa...@gmail.com
Date: Mon, 15 Jun 2015 11:02:50 +0200
Subject: [PATCH] lavu/mem: add av_memcpynt() function with x86 
optimizations
   
Assembly based on code from vlc dxva2.c, commit 62107e56 by Laurent 
Aimar
fen...@videolan.org.
   
TODO: bump minor, update APIchanges
---
 libavutil/mem.c  |  9 +
 libavutil/mem.h  | 14 
 libavutil/mem_internal.h | 26 +++
 libavutil/x86/Makefile   |  1 +
 libavutil/x86/mem.c  | 85 

 5 files changed, 135 insertions(+)
 create mode 100644 libavutil/mem_internal.h
 create mode 100644 libavutil/x86/mem.c
   
diff --git a/libavutil/mem.c b/libavutil/mem.c
index da291fb..0e1eb01 100644
--- a/libavutil/mem.c
+++ b/libavutil/mem.c
@@ -42,6 +42,7 @@
 #include dynarray.h
 #include intreadwrite.h
 #include mem.h
+#include mem_internal.h
   
 #ifdef MALLOC_PREFIX
   
@@ -515,3 +516,11 @@ void av_fast_malloc(void *ptr, unsigned int 
*size, size_t min_size)
 ff_fast_malloc(ptr, size, min_size, 0);
 }
   
+void av_memcpynt(void *dst, const void *src, size_t size, int 
cpu_flags)
+{
+#if ARCH_X86
+ff_memcpynt_x86(dst, src, size, cpu_flags);
+#else
+memcpy(dst, src, size, cpu_flags);
+#endif
+}
  
   Alternatively, what about something like:
  
   av_memcpynt_fn av_memcpynt_get_fn(void);
  
   modeled after av_pixelutils_get_sad_fn()? This would skip the need for
   a wrapper calling the right function.
 

  I don't see much value in this, unless determining the right function
  causes too much overhead.

 I see two advantages, 1. no branch and function call when the function
 is called, 2. the cpu_flags must not be passed around, so it's somehow
 safer.

 I have no strong preference though, updated (untested patch) in
 attachment.
 --
 FFmpeg = Fierce and Forgiving Merciless Powered Extroverse Gargoyle

  mem.c  |9 +
  mem.h  |   13 +++
  mem_internal.h |   26 +++
  x86/Makefile   |1
  x86/mem.c  |   98 
 +
  5 files changed, 147 insertions(+)
 f536b25834e0927b8cab5c996042aae697b8d773  
 0003-lavu-mem-add-av_memcpynt_get_fn.patch
 From c005ff5405dd48e6b0fed24ed94947f69bfe2783 Mon Sep 17 00:00:00 2001
 From: Stefano Sabatini stefa...@gmail.com
 Date: Mon, 15 Jun 2015 11:02:50 +0200
 Subject: [PATCH] lavu/mem: add av_memcpynt_get_fn()

 Assembly based on code from vlc dxva2.c, commit 62107e56 by Laurent Aimar
 fen...@videolan.org.

 TODO: remove use of inline assembly, bump minor, update APIchanges
 ---
  libavutil/mem.c  |  9 +
  libavutil/mem.h  | 13 +++
  libavutil/mem_internal.h | 26 +
  libavutil/x86/Makefile   |  1 +
  libavutil/x86/mem.c  | 98 
 
  5 files changed, 147 insertions(+)
  create mode 100644 libavutil/mem_internal.h
  create mode 100644 libavutil/x86/mem.c

 diff --git a/libavutil/mem.c b/libavutil/mem.c
 index da291fb..325bfc9 100644
 --- a/libavutil/mem.c
 +++ b/libavutil/mem.c
 @@ -42,6 +42,7 @@
  #include dynarray.h
  #include intreadwrite.h
  #include mem.h
 +#include mem_internal.h

  #ifdef MALLOC_PREFIX

 @@ -515,3 +516,11 @@ void av_fast_malloc(void *ptr, unsigned int *size, 
 size_t min_size)
  ff_fast_malloc(ptr, size, min_size, 0);
  }

 +av_memcpynt_fn av_memcpynt_get_fn(void)
 +{
 +#if ARCH_X86
 +return ff_memcpynt_get_fn_x86();
 +#else
 +return memcpy;
 +#endif
 +}
 diff --git a/libavutil/mem.h b/libavutil/mem.h
 index 2a1e36d..d9f1b7a 100644
 --- a/libavutil/mem.h
 +++ b/libavutil/mem.h
 @@ -382,6 +382,19 @@ void *av_fast_realloc(void *ptr, unsigned int *size, 
 size_t min_size);
   */
  void av_fast_malloc(void *ptr, unsigned int *size, size_t min_size);

 +typedef void* (*av_memcpynt_fn)(void *dst, const void *src, size_t size);
 +
 +/**
 + * Return possibly optimized function to copy size bytes from from src
 + * to dst, using non-temporal copy.
 + *
 + * The returned function works as memcpy, but adopts non-temporal
 + * instructios when available. This can lead to better performances
 + * when transferring data from source to destination is expensive, for
 + * example when reading from GPU memory.
 + */
 +av_memcpynt_fn av_memcpynt_get_fn(void);
 +
  /**
   * @}
   */
 diff --git a/libavutil/mem_internal.h b/libavutil/mem_internal.h
 

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-06-16 Thread Hendrik Leppkes
On Tue, Jun 16, 2015 at 2:30 PM, Stefano Sabatini stefa...@gmail.com wrote:
 On date Tuesday 2015-06-16 14:16:11 +0200, Gwenole Beauchesne encoded:
 Hi,

 2015-06-16 14:03 GMT+02:00 Michael Niedermayer michae...@gmx.at:
 [...]
  +#if HAVE_SSE2
  +/* Copy 16/64 bytes from srcp to dstp loading data with the SSE=2 
  instruction
  + * load and storing data with the SSE=2 instruction store.
  + */
  +#define COPY16(dstp, srcp, load, store) \
  +__asm__ volatile (  \
  +load   0(%[src]), %%xmm1\n\
  +store  %%xmm1,0(%[dst])\n \
  +: : [dst]r(dstp), [src]r(srcp) : memory, xmm1)
  +
  +#define COPY64(dstp, srcp, load, store) \
  +__asm__ volatile (  \
  +load   0(%[src]), %%xmm1\n\
  +load  16(%[src]), %%xmm2\n\
  +load  32(%[src]), %%xmm3\n\
  +load  48(%[src]), %%xmm4\n\
  +store  %%xmm1,0(%[dst])\n \
  +store  %%xmm2,   16(%[dst])\n \
  +store  %%xmm3,   32(%[dst])\n \
  +store  %%xmm4,   48(%[dst])\n \
  +: : [dst]r(dstp), [src]r(srcp) : memory, xmm1, xmm2, 
  xmm3, xmm4)
  +#endif
  +
  +#define COPY_LINE(dstp, srcp, size, load)   \
  +const unsigned unaligned = (-(uintptr_t)srcp)  0x0f;   \
  +unsigned x = unaligned; \
  +\
  +av_assert0(((intptr_t)dstp  0x0f) == 0);   \
  +\
  +__asm__ volatile (mfence);\
  +if (!unaligned) {   \
  +for (; x+63  size; x += 64)\
  +COPY64(dstp[x], srcp[x], load, movdqa); \
  +} else {\
  +COPY16(dst, src, movdqu, movdqa);   \
  +for (; x+63  size; x += 64)\
  +COPY64(dstp[x], srcp[x], load, movdqu); \
 
  to use SSE registers in inline asm operands or clobber list you need
  to build with -msse (which probably is default on on x86-64)
 
  files build with -msse will result in undefined behavior if anything
  in them is executed on a pre SSE cpu, as these allow gcc to put
  SSE instructions directly in the code where it likes
 
  The way out of this design is not to tell gcc that it passes
  a string with SSE code to the assembler
  that is not to use SSE registers in operands and not to put them
  on the clobber list unless gcc actually is in SSE mode and can use and
  need them there.
  see XMM_CLOBBERS*

 Well, from past experience, lying to gcc is generally not a good thing
 either. There are multiple interesting ways it could fail from time to
 time. :)

 Other approaches:
 - With GCC = 4.4, you can use __attribute__((target(T))) where T =
 ssse3, sse4.1, etc. This is the easiest way ;
 - Split into several separate files per target. Though, one would then
 argue that while we are at it why not just start moving to yasm.


 The former approach looks more appealing to me, considering there may
 be an effort to migrate to yasm afterwards.

 I plan to port this patch to yasm. I'll ask for help on IRC since
 probably it will take too much time otherwise without any guidance.
 --

If you accept a few restrictions (like requiring aligned and padded
input/output) and maybe give it a more specific name so that people
won't try to replace generic memcpy with it, yasm'ing it would be
pretty simple.
If you want it to be generic like the C version, supporting unaligned
and whatnot, the asm is going to get a bit more verbose..

I could probably whip up a basic implementation of the restricted
version, and the yasm experts can make suggestions on improvements
then.

- Hendrik
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-06-16 Thread Michael Niedermayer
On Tue, Jun 16, 2015 at 10:35:52AM +0200, Stefano Sabatini wrote:
 On date Tuesday 2015-06-16 10:20:31 +0200, wm4 encoded:
  On Mon, 15 Jun 2015 17:55:35 +0200
  Stefano Sabatini stefa...@gmail.com wrote:
  
   On date Monday 2015-06-15 11:56:13 +0200, Stefano Sabatini encoded:
   [...]
From 3a75ef1e86360cd6f30b8e550307404d0d1c1dba Mon Sep 17 00:00:00 2001
From: Stefano Sabatini stefa...@gmail.com
Date: Mon, 15 Jun 2015 11:02:50 +0200
Subject: [PATCH] lavu/mem: add av_memcpynt() function with x86 
optimizations

Assembly based on code from vlc dxva2.c, commit 62107e56 by Laurent 
Aimar
fen...@videolan.org.

TODO: bump minor, update APIchanges
---
 libavutil/mem.c  |  9 +
 libavutil/mem.h  | 14 
 libavutil/mem_internal.h | 26 +++
 libavutil/x86/Makefile   |  1 +
 libavutil/x86/mem.c  | 85 

 5 files changed, 135 insertions(+)
 create mode 100644 libavutil/mem_internal.h
 create mode 100644 libavutil/x86/mem.c

diff --git a/libavutil/mem.c b/libavutil/mem.c
index da291fb..0e1eb01 100644
--- a/libavutil/mem.c
+++ b/libavutil/mem.c
@@ -42,6 +42,7 @@
 #include dynarray.h
 #include intreadwrite.h
 #include mem.h
+#include mem_internal.h
 
 #ifdef MALLOC_PREFIX
 
@@ -515,3 +516,11 @@ void av_fast_malloc(void *ptr, unsigned int *size, 
size_t min_size)
 ff_fast_malloc(ptr, size, min_size, 0);
 }
 
+void av_memcpynt(void *dst, const void *src, size_t size, int 
cpu_flags)
+{
+#if ARCH_X86
+ff_memcpynt_x86(dst, src, size, cpu_flags);
+#else
+memcpy(dst, src, size, cpu_flags);
+#endif
+}
   
   Alternatively, what about something like:
   
   av_memcpynt_fn av_memcpynt_get_fn(void);
   
   modeled after av_pixelutils_get_sad_fn()? This would skip the need for
   a wrapper calling the right function.
  
 
  I don't see much value in this, unless determining the right function
  causes too much overhead.
 
 I see two advantages, 1. no branch and function call when the function
 is called, 2. the cpu_flags must not be passed around, so it's somehow
 safer.
 
 I have no strong preference though, updated (untested patch) in
 attachment.
 -- 
 FFmpeg = Fierce and Forgiving Merciless Powered Extroverse Gargoyle

  mem.c  |9 +
  mem.h  |   13 +++
  mem_internal.h |   26 +++
  x86/Makefile   |1 
  x86/mem.c  |   98 
 +
  5 files changed, 147 insertions(+)
 f536b25834e0927b8cab5c996042aae697b8d773  
 0003-lavu-mem-add-av_memcpynt_get_fn.patch
 From c005ff5405dd48e6b0fed24ed94947f69bfe2783 Mon Sep 17 00:00:00 2001
 From: Stefano Sabatini stefa...@gmail.com
 Date: Mon, 15 Jun 2015 11:02:50 +0200
 Subject: [PATCH] lavu/mem: add av_memcpynt_get_fn()
 
 Assembly based on code from vlc dxva2.c, commit 62107e56 by Laurent Aimar
 fen...@videolan.org.
 
 TODO: remove use of inline assembly, bump minor, update APIchanges
 ---
  libavutil/mem.c  |  9 +
  libavutil/mem.h  | 13 +++
  libavutil/mem_internal.h | 26 +
  libavutil/x86/Makefile   |  1 +
  libavutil/x86/mem.c  | 98 
 
  5 files changed, 147 insertions(+)
  create mode 100644 libavutil/mem_internal.h
  create mode 100644 libavutil/x86/mem.c
 
 diff --git a/libavutil/mem.c b/libavutil/mem.c
 index da291fb..325bfc9 100644
 --- a/libavutil/mem.c
 +++ b/libavutil/mem.c
 @@ -42,6 +42,7 @@
  #include dynarray.h
  #include intreadwrite.h
  #include mem.h
 +#include mem_internal.h
  
  #ifdef MALLOC_PREFIX
  
 @@ -515,3 +516,11 @@ void av_fast_malloc(void *ptr, unsigned int *size, 
 size_t min_size)
  ff_fast_malloc(ptr, size, min_size, 0);
  }
  
 +av_memcpynt_fn av_memcpynt_get_fn(void)
 +{
 +#if ARCH_X86
 +return ff_memcpynt_get_fn_x86();
 +#else
 +return memcpy;
 +#endif
 +}
 diff --git a/libavutil/mem.h b/libavutil/mem.h
 index 2a1e36d..d9f1b7a 100644
 --- a/libavutil/mem.h
 +++ b/libavutil/mem.h
 @@ -382,6 +382,19 @@ void *av_fast_realloc(void *ptr, unsigned int *size, 
 size_t min_size);
   */
  void av_fast_malloc(void *ptr, unsigned int *size, size_t min_size);
  
 +typedef void* (*av_memcpynt_fn)(void *dst, const void *src, size_t size);
 +
 +/**
 + * Return possibly optimized function to copy size bytes from from src
 + * to dst, using non-temporal copy.
 + *
 + * The returned function works as memcpy, but adopts non-temporal
 + * instructios when available. This can lead to better performances
 + * when transferring data from source to destination is expensive, for
 + * example when reading from GPU memory.
 + */
 +av_memcpynt_fn av_memcpynt_get_fn(void);
 +
  /**
   * @}
   */
 diff --git a/libavutil/mem_internal.h b/libavutil/mem_internal.h
 new file mode 100644
 index 

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-06-16 Thread wm4
On Tue, 16 Jun 2015 14:16:11 +0200
Gwenole Beauchesne gb.de...@gmail.com wrote:

 Hi,
 
 2015-06-16 14:03 GMT+02:00 Michael Niedermayer michae...@gmx.at:
  On Tue, Jun 16, 2015 at 10:35:52AM +0200, Stefano Sabatini wrote:
  On date Tuesday 2015-06-16 10:20:31 +0200, wm4 encoded:
   On Mon, 15 Jun 2015 17:55:35 +0200
   Stefano Sabatini stefa...@gmail.com wrote:
  
On date Monday 2015-06-15 11:56:13 +0200, Stefano Sabatini encoded:
[...]
 From 3a75ef1e86360cd6f30b8e550307404d0d1c1dba Mon Sep 17 00:00:00 
 2001
 From: Stefano Sabatini stefa...@gmail.com
 Date: Mon, 15 Jun 2015 11:02:50 +0200
 Subject: [PATCH] lavu/mem: add av_memcpynt() function with x86 
 optimizations

 Assembly based on code from vlc dxva2.c, commit 62107e56 by Laurent 
 Aimar
 fen...@videolan.org.

 TODO: bump minor, update APIchanges
 ---
  libavutil/mem.c  |  9 +
  libavutil/mem.h  | 14 
  libavutil/mem_internal.h | 26 +++
  libavutil/x86/Makefile   |  1 +
  libavutil/x86/mem.c  | 85 
 
  5 files changed, 135 insertions(+)
  create mode 100644 libavutil/mem_internal.h
  create mode 100644 libavutil/x86/mem.c

 diff --git a/libavutil/mem.c b/libavutil/mem.c
 index da291fb..0e1eb01 100644
 --- a/libavutil/mem.c
 +++ b/libavutil/mem.c
 @@ -42,6 +42,7 @@
  #include dynarray.h
  #include intreadwrite.h
  #include mem.h
 +#include mem_internal.h

  #ifdef MALLOC_PREFIX

 @@ -515,3 +516,11 @@ void av_fast_malloc(void *ptr, unsigned int 
 *size, size_t min_size)
  ff_fast_malloc(ptr, size, min_size, 0);
  }

 +void av_memcpynt(void *dst, const void *src, size_t size, int 
 cpu_flags)
 +{
 +#if ARCH_X86
 +ff_memcpynt_x86(dst, src, size, cpu_flags);
 +#else
 +memcpy(dst, src, size, cpu_flags);
 +#endif
 +}
   
Alternatively, what about something like:
   
av_memcpynt_fn av_memcpynt_get_fn(void);
   
modeled after av_pixelutils_get_sad_fn()? This would skip the need for
a wrapper calling the right function.
  
 
   I don't see much value in this, unless determining the right function
   causes too much overhead.
 
  I see two advantages, 1. no branch and function call when the function
  is called, 2. the cpu_flags must not be passed around, so it's somehow
  safer.
 
  I have no strong preference though, updated (untested patch) in
  attachment.
  --
  FFmpeg = Fierce and Forgiving Merciless Powered Extroverse Gargoyle
 
   mem.c  |9 +
   mem.h  |   13 +++
   mem_internal.h |   26 +++
   x86/Makefile   |1
   x86/mem.c  |   98 
  +
   5 files changed, 147 insertions(+)
  f536b25834e0927b8cab5c996042aae697b8d773  
  0003-lavu-mem-add-av_memcpynt_get_fn.patch
  From c005ff5405dd48e6b0fed24ed94947f69bfe2783 Mon Sep 17 00:00:00 2001
  From: Stefano Sabatini stefa...@gmail.com
  Date: Mon, 15 Jun 2015 11:02:50 +0200
  Subject: [PATCH] lavu/mem: add av_memcpynt_get_fn()
 
  Assembly based on code from vlc dxva2.c, commit 62107e56 by Laurent Aimar
  fen...@videolan.org.
 
  TODO: remove use of inline assembly, bump minor, update APIchanges
  ---
   libavutil/mem.c  |  9 +
   libavutil/mem.h  | 13 +++
   libavutil/mem_internal.h | 26 +
   libavutil/x86/Makefile   |  1 +
   libavutil/x86/mem.c  | 98 
  
   5 files changed, 147 insertions(+)
   create mode 100644 libavutil/mem_internal.h
   create mode 100644 libavutil/x86/mem.c
 
  diff --git a/libavutil/mem.c b/libavutil/mem.c
  index da291fb..325bfc9 100644
  --- a/libavutil/mem.c
  +++ b/libavutil/mem.c
  @@ -42,6 +42,7 @@
   #include dynarray.h
   #include intreadwrite.h
   #include mem.h
  +#include mem_internal.h
 
   #ifdef MALLOC_PREFIX
 
  @@ -515,3 +516,11 @@ void av_fast_malloc(void *ptr, unsigned int *size, 
  size_t min_size)
   ff_fast_malloc(ptr, size, min_size, 0);
   }
 
  +av_memcpynt_fn av_memcpynt_get_fn(void)
  +{
  +#if ARCH_X86
  +return ff_memcpynt_get_fn_x86();
  +#else
  +return memcpy;
  +#endif
  +}
  diff --git a/libavutil/mem.h b/libavutil/mem.h
  index 2a1e36d..d9f1b7a 100644
  --- a/libavutil/mem.h
  +++ b/libavutil/mem.h
  @@ -382,6 +382,19 @@ void *av_fast_realloc(void *ptr, unsigned int *size, 
  size_t min_size);
*/
   void av_fast_malloc(void *ptr, unsigned int *size, size_t min_size);
 
  +typedef void* (*av_memcpynt_fn)(void *dst, const void *src, size_t size);
  +
  +/**
  + * Return possibly optimized function to copy size bytes from from src
  + * to dst, using non-temporal copy.
  + *
  + * The returned function works as memcpy, but adopts non-temporal
  + * instructios when available. This can lead to better performances
  + * when 

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-06-16 Thread Stefano Sabatini
On date Tuesday 2015-06-16 14:16:11 +0200, Gwenole Beauchesne encoded:
 Hi,
 
 2015-06-16 14:03 GMT+02:00 Michael Niedermayer michae...@gmx.at:
[...]
  +#if HAVE_SSE2
  +/* Copy 16/64 bytes from srcp to dstp loading data with the SSE=2 
  instruction
  + * load and storing data with the SSE=2 instruction store.
  + */
  +#define COPY16(dstp, srcp, load, store) \
  +__asm__ volatile (  \
  +load   0(%[src]), %%xmm1\n\
  +store  %%xmm1,0(%[dst])\n \
  +: : [dst]r(dstp), [src]r(srcp) : memory, xmm1)
  +
  +#define COPY64(dstp, srcp, load, store) \
  +__asm__ volatile (  \
  +load   0(%[src]), %%xmm1\n\
  +load  16(%[src]), %%xmm2\n\
  +load  32(%[src]), %%xmm3\n\
  +load  48(%[src]), %%xmm4\n\
  +store  %%xmm1,0(%[dst])\n \
  +store  %%xmm2,   16(%[dst])\n \
  +store  %%xmm3,   32(%[dst])\n \
  +store  %%xmm4,   48(%[dst])\n \
  +: : [dst]r(dstp), [src]r(srcp) : memory, xmm1, xmm2, 
  xmm3, xmm4)
  +#endif
  +
  +#define COPY_LINE(dstp, srcp, size, load)   \
  +const unsigned unaligned = (-(uintptr_t)srcp)  0x0f;   \
  +unsigned x = unaligned; \
  +\
  +av_assert0(((intptr_t)dstp  0x0f) == 0);   \
  +\
  +__asm__ volatile (mfence);\
  +if (!unaligned) {   \
  +for (; x+63  size; x += 64)\
  +COPY64(dstp[x], srcp[x], load, movdqa); \
  +} else {\
  +COPY16(dst, src, movdqu, movdqa);   \
  +for (; x+63  size; x += 64)\
  +COPY64(dstp[x], srcp[x], load, movdqu); \
 
  to use SSE registers in inline asm operands or clobber list you need
  to build with -msse (which probably is default on on x86-64)
 
  files build with -msse will result in undefined behavior if anything
  in them is executed on a pre SSE cpu, as these allow gcc to put
  SSE instructions directly in the code where it likes
 
  The way out of this design is not to tell gcc that it passes
  a string with SSE code to the assembler
  that is not to use SSE registers in operands and not to put them
  on the clobber list unless gcc actually is in SSE mode and can use and
  need them there.
  see XMM_CLOBBERS*
 
 Well, from past experience, lying to gcc is generally not a good thing
 either. There are multiple interesting ways it could fail from time to
 time. :)
 
 Other approaches:
 - With GCC = 4.4, you can use __attribute__((target(T))) where T =
 ssse3, sse4.1, etc. This is the easiest way ;
 - Split into several separate files per target. Though, one would then
 argue that while we are at it why not just start moving to yasm.
 

 The former approach looks more appealing to me, considering there may
 be an effort to migrate to yasm afterwards.

I plan to port this patch to yasm. I'll ask for help on IRC since
probably it will take too much time otherwise without any guidance.
-- 
FFmpeg = Friendly and Fancy Mind-dumbing Pacific Easy Generator
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-06-16 Thread Stefano Sabatini
On date Tuesday 2015-06-16 10:20:31 +0200, wm4 encoded:
 On Mon, 15 Jun 2015 17:55:35 +0200
 Stefano Sabatini stefa...@gmail.com wrote:
 
  On date Monday 2015-06-15 11:56:13 +0200, Stefano Sabatini encoded:
  [...]
   From 3a75ef1e86360cd6f30b8e550307404d0d1c1dba Mon Sep 17 00:00:00 2001
   From: Stefano Sabatini stefa...@gmail.com
   Date: Mon, 15 Jun 2015 11:02:50 +0200
   Subject: [PATCH] lavu/mem: add av_memcpynt() function with x86 
   optimizations
   
   Assembly based on code from vlc dxva2.c, commit 62107e56 by Laurent Aimar
   fen...@videolan.org.
   
   TODO: bump minor, update APIchanges
   ---
libavutil/mem.c  |  9 +
libavutil/mem.h  | 14 
libavutil/mem_internal.h | 26 +++
libavutil/x86/Makefile   |  1 +
libavutil/x86/mem.c  | 85 
   
5 files changed, 135 insertions(+)
create mode 100644 libavutil/mem_internal.h
create mode 100644 libavutil/x86/mem.c
   
   diff --git a/libavutil/mem.c b/libavutil/mem.c
   index da291fb..0e1eb01 100644
   --- a/libavutil/mem.c
   +++ b/libavutil/mem.c
   @@ -42,6 +42,7 @@
#include dynarray.h
#include intreadwrite.h
#include mem.h
   +#include mem_internal.h

#ifdef MALLOC_PREFIX

   @@ -515,3 +516,11 @@ void av_fast_malloc(void *ptr, unsigned int *size, 
   size_t min_size)
ff_fast_malloc(ptr, size, min_size, 0);
}

   +void av_memcpynt(void *dst, const void *src, size_t size, int cpu_flags)
   +{
   +#if ARCH_X86
   +ff_memcpynt_x86(dst, src, size, cpu_flags);
   +#else
   +memcpy(dst, src, size, cpu_flags);
   +#endif
   +}
  
  Alternatively, what about something like:
  
  av_memcpynt_fn av_memcpynt_get_fn(void);
  
  modeled after av_pixelutils_get_sad_fn()? This would skip the need for
  a wrapper calling the right function.
 

 I don't see much value in this, unless determining the right function
 causes too much overhead.

I see two advantages, 1. no branch and function call when the function
is called, 2. the cpu_flags must not be passed around, so it's somehow
safer.

I have no strong preference though, updated (untested patch) in
attachment.
-- 
FFmpeg = Fierce and Forgiving Merciless Powered Extroverse Gargoyle
From c005ff5405dd48e6b0fed24ed94947f69bfe2783 Mon Sep 17 00:00:00 2001
From: Stefano Sabatini stefa...@gmail.com
Date: Mon, 15 Jun 2015 11:02:50 +0200
Subject: [PATCH] lavu/mem: add av_memcpynt_get_fn()

Assembly based on code from vlc dxva2.c, commit 62107e56 by Laurent Aimar
fen...@videolan.org.

TODO: remove use of inline assembly, bump minor, update APIchanges
---
 libavutil/mem.c  |  9 +
 libavutil/mem.h  | 13 +++
 libavutil/mem_internal.h | 26 +
 libavutil/x86/Makefile   |  1 +
 libavutil/x86/mem.c  | 98 
 5 files changed, 147 insertions(+)
 create mode 100644 libavutil/mem_internal.h
 create mode 100644 libavutil/x86/mem.c

diff --git a/libavutil/mem.c b/libavutil/mem.c
index da291fb..325bfc9 100644
--- a/libavutil/mem.c
+++ b/libavutil/mem.c
@@ -42,6 +42,7 @@
 #include dynarray.h
 #include intreadwrite.h
 #include mem.h
+#include mem_internal.h
 
 #ifdef MALLOC_PREFIX
 
@@ -515,3 +516,11 @@ void av_fast_malloc(void *ptr, unsigned int *size, size_t min_size)
 ff_fast_malloc(ptr, size, min_size, 0);
 }
 
+av_memcpynt_fn av_memcpynt_get_fn(void)
+{
+#if ARCH_X86
+return ff_memcpynt_get_fn_x86();
+#else
+return memcpy;
+#endif
+}
diff --git a/libavutil/mem.h b/libavutil/mem.h
index 2a1e36d..d9f1b7a 100644
--- a/libavutil/mem.h
+++ b/libavutil/mem.h
@@ -382,6 +382,19 @@ void *av_fast_realloc(void *ptr, unsigned int *size, size_t min_size);
  */
 void av_fast_malloc(void *ptr, unsigned int *size, size_t min_size);
 
+typedef void* (*av_memcpynt_fn)(void *dst, const void *src, size_t size);
+
+/**
+ * Return possibly optimized function to copy size bytes from from src
+ * to dst, using non-temporal copy.
+ *
+ * The returned function works as memcpy, but adopts non-temporal
+ * instructios when available. This can lead to better performances
+ * when transferring data from source to destination is expensive, for
+ * example when reading from GPU memory.
+ */
+av_memcpynt_fn av_memcpynt_get_fn(void);
+
 /**
  * @}
  */
diff --git a/libavutil/mem_internal.h b/libavutil/mem_internal.h
new file mode 100644
index 000..de61cba
--- /dev/null
+++ b/libavutil/mem_internal.h
@@ -0,0 +1,26 @@
+/*
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See 

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-06-16 Thread wm4
On Mon, 15 Jun 2015 17:55:35 +0200
Stefano Sabatini stefa...@gmail.com wrote:

 On date Monday 2015-06-15 11:56:13 +0200, Stefano Sabatini encoded:
 [...]
  From 3a75ef1e86360cd6f30b8e550307404d0d1c1dba Mon Sep 17 00:00:00 2001
  From: Stefano Sabatini stefa...@gmail.com
  Date: Mon, 15 Jun 2015 11:02:50 +0200
  Subject: [PATCH] lavu/mem: add av_memcpynt() function with x86 optimizations
  
  Assembly based on code from vlc dxva2.c, commit 62107e56 by Laurent Aimar
  fen...@videolan.org.
  
  TODO: bump minor, update APIchanges
  ---
   libavutil/mem.c  |  9 +
   libavutil/mem.h  | 14 
   libavutil/mem_internal.h | 26 +++
   libavutil/x86/Makefile   |  1 +
   libavutil/x86/mem.c  | 85 
  
   5 files changed, 135 insertions(+)
   create mode 100644 libavutil/mem_internal.h
   create mode 100644 libavutil/x86/mem.c
  
  diff --git a/libavutil/mem.c b/libavutil/mem.c
  index da291fb..0e1eb01 100644
  --- a/libavutil/mem.c
  +++ b/libavutil/mem.c
  @@ -42,6 +42,7 @@
   #include dynarray.h
   #include intreadwrite.h
   #include mem.h
  +#include mem_internal.h
   
   #ifdef MALLOC_PREFIX
   
  @@ -515,3 +516,11 @@ void av_fast_malloc(void *ptr, unsigned int *size, 
  size_t min_size)
   ff_fast_malloc(ptr, size, min_size, 0);
   }
   
  +void av_memcpynt(void *dst, const void *src, size_t size, int cpu_flags)
  +{
  +#if ARCH_X86
  +ff_memcpynt_x86(dst, src, size, cpu_flags);
  +#else
  +memcpy(dst, src, size, cpu_flags);
  +#endif
  +}
 
 Alternatively, what about something like:
 
 av_memcpynt_fn av_memcpynt_get_fn(void);
 
 modeled after av_pixelutils_get_sad_fn()? This would skip the need for
 a wrapper calling the right function.

I don't see much value in this, unless determining the right function
causes too much overhead.
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-06-15 Thread Stefano Sabatini
On date Saturday 2015-06-13 14:20:07 +0200, Hendrik Leppkes encoded:
 On Thu, Jun 11, 2015 at 8:54 PM, wm4 nfx...@googlemail.com wrote:
  On Thu, 11 Jun 2015 17:24:45 +0200
  Stefano Sabatini stefa...@gmail.com wrote:
 
  Next step would be the use of YASM, but I only want to test if the
  general approach is fine (and if the API is not too specific). Also if
  someone wants to step up and port it to YASM I'm all for it, since
  ASM/YASM is far from being my area of expertise.
 
  Personally, I'd probably just
  1. export the GPU memcpy function, and
  2. export a function to copy AVFrames using this function
 
 I concur. A basic optimized memcpy with specific constraints (ie.
 requires aligned input/output, always copies in 16-byte chunks, so
 in/out buffers need to be padded appropriately), to keep the required
 ASM code simple.
 These constraints are generally always fulfilled if you have a GPU
 frame on the input, since they will have appropriate strides (and if
 in question, we control allocation of the GPU surfaces as well), and
 we control the output memory buffer anyway.
 
 On top of that a convenience function that deals with pixel formats,
 strides, planes, and whatnot, and then uses this function.
 A generic C version of the basic copy function shouldn't be needed, we
 could just use memcpy for that.. or a tiny wrapper that calls memcpy,
 anyway.

This is my first attempt, the added function is named av_memcpynt(),
it is using inline assembly which should be replaced by yasm once me
or someone else figures out how to do it.

An av_image_copynt_plane() function can be built on top of that (but
in this case it would be better to inline the av_memcpynt() function).

BTW I dropped the requirement of 16-bits alignment on the size
variable which is required by the VLC code but which looks unnecessary
to me.
-- 
FFmpeg = Furious and Foolish Marvellous Pacific Egregious Ghost
From 3a75ef1e86360cd6f30b8e550307404d0d1c1dba Mon Sep 17 00:00:00 2001
From: Stefano Sabatini stefa...@gmail.com
Date: Mon, 15 Jun 2015 11:02:50 +0200
Subject: [PATCH] lavu/mem: add av_memcpynt() function with x86 optimizations

Assembly based on code from vlc dxva2.c, commit 62107e56 by Laurent Aimar
fen...@videolan.org.

TODO: bump minor, update APIchanges
---
 libavutil/mem.c  |  9 +
 libavutil/mem.h  | 14 
 libavutil/mem_internal.h | 26 +++
 libavutil/x86/Makefile   |  1 +
 libavutil/x86/mem.c  | 85 
 5 files changed, 135 insertions(+)
 create mode 100644 libavutil/mem_internal.h
 create mode 100644 libavutil/x86/mem.c

diff --git a/libavutil/mem.c b/libavutil/mem.c
index da291fb..0e1eb01 100644
--- a/libavutil/mem.c
+++ b/libavutil/mem.c
@@ -42,6 +42,7 @@
 #include dynarray.h
 #include intreadwrite.h
 #include mem.h
+#include mem_internal.h
 
 #ifdef MALLOC_PREFIX
 
@@ -515,3 +516,11 @@ void av_fast_malloc(void *ptr, unsigned int *size, size_t min_size)
 ff_fast_malloc(ptr, size, min_size, 0);
 }
 
+void av_memcpynt(void *dst, const void *src, size_t size, int cpu_flags)
+{
+#if ARCH_X86
+ff_memcpynt_x86(dst, src, size, cpu_flags);
+#else
+memcpy(dst, src, size, cpu_flags);
+#endif
+}
diff --git a/libavutil/mem.h b/libavutil/mem.h
index 2a1e36d..bbad313 100644
--- a/libavutil/mem.h
+++ b/libavutil/mem.h
@@ -383,6 +383,20 @@ void *av_fast_realloc(void *ptr, unsigned int *size, size_t min_size);
 void av_fast_malloc(void *ptr, unsigned int *size, size_t min_size);
 
 /**
+ * Copy size bytes from from src to dst, using non-temporal copy
+ * functions when available.
+ *
+ * This function works as memcpy, but adopts non-temporal instructios
+ * when available. This can lead to better performances when
+ * transferring data from source to destination is expensive, for
+ * example when reading from GPU memory.
+ *
+ * @param dst destination memory pointer, must be aligned to 16 bits
+ * @param cpu_flags as returned by av_get_cpu_flags()
+ */
+void av_memcpynt(void *dst, const void *src, size_t size, int cpu_flags);
+
+/**
  * @}
  */
 
diff --git a/libavutil/mem_internal.h b/libavutil/mem_internal.h
new file mode 100644
index 000..371be31
--- /dev/null
+++ b/libavutil/mem_internal.h
@@ -0,0 +1,26 @@
+/*
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, 

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-06-15 Thread Stefano Sabatini
On date Monday 2015-06-15 11:56:13 +0200, Stefano Sabatini encoded:
[...]
 From 3a75ef1e86360cd6f30b8e550307404d0d1c1dba Mon Sep 17 00:00:00 2001
 From: Stefano Sabatini stefa...@gmail.com
 Date: Mon, 15 Jun 2015 11:02:50 +0200
 Subject: [PATCH] lavu/mem: add av_memcpynt() function with x86 optimizations
 
 Assembly based on code from vlc dxva2.c, commit 62107e56 by Laurent Aimar
 fen...@videolan.org.
 
 TODO: bump minor, update APIchanges
 ---
  libavutil/mem.c  |  9 +
  libavutil/mem.h  | 14 
  libavutil/mem_internal.h | 26 +++
  libavutil/x86/Makefile   |  1 +
  libavutil/x86/mem.c  | 85 
 
  5 files changed, 135 insertions(+)
  create mode 100644 libavutil/mem_internal.h
  create mode 100644 libavutil/x86/mem.c
 
 diff --git a/libavutil/mem.c b/libavutil/mem.c
 index da291fb..0e1eb01 100644
 --- a/libavutil/mem.c
 +++ b/libavutil/mem.c
 @@ -42,6 +42,7 @@
  #include dynarray.h
  #include intreadwrite.h
  #include mem.h
 +#include mem_internal.h
  
  #ifdef MALLOC_PREFIX
  
 @@ -515,3 +516,11 @@ void av_fast_malloc(void *ptr, unsigned int *size, 
 size_t min_size)
  ff_fast_malloc(ptr, size, min_size, 0);
  }
  
 +void av_memcpynt(void *dst, const void *src, size_t size, int cpu_flags)
 +{
 +#if ARCH_X86
 +ff_memcpynt_x86(dst, src, size, cpu_flags);
 +#else
 +memcpy(dst, src, size, cpu_flags);
 +#endif
 +}

Alternatively, what about something like:

av_memcpynt_fn av_memcpynt_get_fn(void);

modeled after av_pixelutils_get_sad_fn()? This would skip the need for
a wrapper calling the right function.
-- 
FFmpeg = Frightening and Fantastic Murdering Portentous Erratic Guru
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-06-13 Thread Hendrik Leppkes
On Thu, Jun 11, 2015 at 8:54 PM, wm4 nfx...@googlemail.com wrote:
 On Thu, 11 Jun 2015 17:24:45 +0200
 Stefano Sabatini stefa...@gmail.com wrote:

 Next step would be the use of YASM, but I only want to test if the
 general approach is fine (and if the API is not too specific). Also if
 someone wants to step up and port it to YASM I'm all for it, since
 ASM/YASM is far from being my area of expertise.

 Personally, I'd probably just
 1. export the GPU memcpy function, and
 2. export a function to copy AVFrames using this function

I concur. A basic optimized memcpy with specific constraints (ie.
requires aligned input/output, always copies in 16-byte chunks, so
in/out buffers need to be padded appropriately), to keep the required
ASM code simple.
These constraints are generally always fulfilled if you have a GPU
frame on the input, since they will have appropriate strides (and if
in question, we control allocation of the GPU surfaces as well), and
we control the output memory buffer anyway.

On top of that a convenience function that deals with pixel formats,
strides, planes, and whatnot, and then uses this function.
A generic C version of the basic copy function shouldn't be needed, we
could just use memcpy for that.. or a tiny wrapper that calls memcpy,
anyway.

- Hendrik
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-06-11 Thread Stefano Sabatini
On date Friday 2015-05-29 09:47:58 -0700, Timothy Gu encoded:
 On Fri, May 29, 2015 at 03:49:22PM +0200, Stefano Sabatini wrote:
[...]
   OBJS-$(CONFIG_PIXELUTILS) += x86/pixelutils_init.o  \
  diff --git a/libavutil/x86/imgutils.c b/libavutil/x86/imgutils.c
  new file mode 100644
  index 000..8b3ed0f
  --- /dev/null
  +++ b/libavutil/x86/imgutils.c
  @@ -0,0 +1,95 @@
  +/*
  + * This file is part of FFmpeg.
  + *
  + * FFmpeg is free software; you can redistribute it and/or
  + * modify it under the terms of the GNU Lesser General Public
  + * License as published by the Free Software Foundation; either
  + * version 2.1 of the License, or (at your option) any later version.
  + *
  + * FFmpeg is distributed in the hope that it will be useful,
  + * but WITHOUT ANY WARRANTY; without even the implied warranty of
  + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
  + * Lesser General Public License for more details.
  + *
  + * You should have received a copy of the GNU Lesser General Public
  + * License along with FFmpeg; if not, write to the Free Software
  + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 
  02110-1301 USA
  + */
  +
  +#include inttypes.h
  +#include config.h
  +#include libavutil/avassert.h
  +#include libavutil/imgutils.h
  +#include libavutil/imgutils_internal.h
  +
  +#if HAVE_SSE2
  +/* Copy 16/64 bytes from srcp to dstp loading data with the SSE=2 
  instruction
  + * load and storing data with the SSE=2 instruction store.
  + */
  +#define COPY16(dstp, srcp, load, store) \
  +__asm__ volatile (  \
  +load   0(%[src]), %%xmm1\n\
  +store  %%xmm1,0(%[dst])\n \
  +: : [dst]r(dstp), [src]r(srcp) : memory, xmm1)
  +
  +#define COPY64(dstp, srcp, load, store) \
  +__asm__ volatile (  \
  +load   0(%[src]), %%xmm1\n\
  +load  16(%[src]), %%xmm2\n\
  +load  32(%[src]), %%xmm3\n\
  +load  48(%[src]), %%xmm4\n\
  +store  %%xmm1,0(%[dst])\n \
  +store  %%xmm2,   16(%[dst])\n \
  +store  %%xmm3,   32(%[dst])\n \
  +store  %%xmm4,   48(%[dst])\n \
  +: : [dst]r(dstp), [src]r(srcp) : memory, xmm1, xmm2, 
  xmm3, xmm4)
  +#endif
  +
  +void ff_image_copy_plane_from_uswc_x86(uint8_t *dst, size_t dst_linesize,
  +  const uint8_t *src, size_t src_linesize,
  +  unsigned bytewidth, unsigned height,
  +  int cpu_flags)
  +{
  +#if !HAVE_SSSE3
 

 Are any SSSE3 instructions used?

No. I re-checked, MOVDQA/MOVDQU were introduced in SSE2, MOVNTDQA in SSE4. 

  +return av_image_copy_plane(dst, dst_linesize, src, src_linesize, 
  bytewidth, height);
  +#endif
  +
  +av_assert0(((intptr_t)dst  0x0f) == 0  (dst_linesize  0x0f) == 0);
  +
  +__asm__ volatile (mfence);
  +
  +for (unsigned y = 0; y  height; y++) {
  +const unsigned unaligned = (-(uintptr_t)src)  0x0f;
  +unsigned x = unaligned;
  +
 
  +#if HAVE_SSE42
  +if (cpu_flags  AV_CPU_FLAG_SSE4) {
 
 movntdqa is an SSE4.1 instruction, so this should work better:
 
 if (INLINE_SSE4(cpu_flags))
 
 That checks both HAVE_SSE4_INLINE and cpu_flags for AV_CPU_FLAG_SSE4.
 
 (But then like others have said new inline asm code shouldn't be added in the
 first place)

Next step would be the use of YASM, but I only want to test if the
general approach is fine (and if the API is not too specific). Also if
someone wants to step up and port it to YASM I'm all for it, since
ASM/YASM is far from being my area of expertise.
-- 
FFmpeg = Fiendish Fabulous Most Pure Evangelical God
From ec96aee1930247248a5e438171c120ea3f5dbbea Mon Sep 17 00:00:00 2001
From: Stefano Sabatini stefa...@gmail.com
Date: Fri, 15 May 2015 18:58:17 +0200
Subject: [PATCH] lavu/imgutils: add av_image_copy_plane_from_uswc() function.

This function allows support to optimized GPU to CPU.

Based on code from vlc dxva2.c, commit 62107e56 by Laurent Aimar
fen...@videolan.org.

TODO: fix integration with the build system, update APIchanges and bump
minor once ready
---
 libavutil/imgutils.c  |  13 +
 libavutil/imgutils.h  |  18 ++
 libavutil/imgutils_internal.h |  29 ++
 libavutil/x86/Makefile|   1 +
 libavutil/x86/imgutils.c  | 126 ++
 5 files changed, 187 insertions(+)
 create mode 100644 libavutil/imgutils_internal.h
 create mode 100644 libavutil/x86/imgutils.c

diff --git a/libavutil/imgutils.c b/libavutil/imgutils.c
index ef0e671..59a0054 100644
--- a/libavutil/imgutils.c
+++ b/libavutil/imgutils.c
@@ -30,6 +30,7 @@
 #include mathematics.h
 #include pixdesc.h
 #include rational.h
+#include imgutils_internal.h
 
 void av_image_fill_max_pixsteps(int max_pixsteps[4], int max_pixstep_comps[4],
 const AVPixFmtDescriptor 

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-06-11 Thread wm4
On Thu, 11 Jun 2015 17:24:45 +0200
Stefano Sabatini stefa...@gmail.com wrote:

 Next step would be the use of YASM, but I only want to test if the
 general approach is fine (and if the API is not too specific). Also if
 someone wants to step up and port it to YASM I'm all for it, since
 ASM/YASM is far from being my area of expertise.

Personally, I'd probably just
1. export the GPU memcpy function, and
2. export a function to copy AVFrames using this function
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-05-29 Thread Stefano Sabatini
On date Thursday 2015-05-28 18:02:34 -0300, James Almer encoded:
 On 28/05/15 2:39 PM, Stefano Sabatini wrote:
  From f3b4e77dd9dd299aba8f4fa83625d2b61b243c3c Mon Sep 17 00:00:00 2001
  From: Stefano Sabatini stefa...@gmail.com
  Date: Fri, 15 May 2015 18:58:17 +0200
  Subject: [PATCH] lavu/imgutils: add av_image_copy_plane_from_uswc() 
  function.
  
  This function allows support to optimized GPU to CPU.
  
  Based on code from vlc dxva2.c, commit 62107e56 by Laurent Aimar
  fen...@videolan.org.
  
  TODO: fix integration with the build system, bump micro
  
  Signed-off-by: Stefano Sabatini stefa...@gmail.com
  ---
   libavutil/imgutils.c  |  14 ++
   libavutil/imgutils.h  |  18 +++
   libavutil/imgutils_internal.h |  29 +++
   libavutil/x86/Makefile|   1 +
   libavutil/x86/imgutils.c  | 109 
  ++
   5 files changed, 171 insertions(+)
   create mode 100644 libavutil/imgutils_internal.h
   create mode 100644 libavutil/x86/imgutils.c
  
  diff --git a/libavutil/imgutils.c b/libavutil/imgutils.c
  index ef0e671..e538c75 100644
  --- a/libavutil/imgutils.c
  +++ b/libavutil/imgutils.c
  @@ -30,6 +30,7 @@
   #include mathematics.h
   #include pixdesc.h
   #include rational.h
  +#include imgutils_internal.h
   
   void av_image_fill_max_pixsteps(int max_pixsteps[4], int 
  max_pixstep_comps[4],
   const AVPixFmtDescriptor *pixdesc)
  @@ -405,3 +406,16 @@ int av_image_copy_to_buffer(uint8_t *dst, int dst_size,
   
   return size;
   }
  +
  +void av_image_copy_plane_from_uswc(uint8_t *dst, size_t dst_linesize,
  +  const uint8_t *src, size_t src_linesize,
  +  unsigned bytewidth, unsigned height,
  +  unsigned cpu_flags)
  +{
  +#ifndef HAVE_SSSE3
 
 All HAVE_ are always defined to either 0 or 1.

Fixed.
 
 Nonetheless, this kind of check does not belong outside of arch folders. You 
 should
 check for ARCH_X86 to call functions in the x86/ folder. See lavc/lavfi for 
 examples.

I see, but I think this use case is pretty different. We don't have a
context where to set a function pointer, and I don't want to add a new
context and API for such things (but I'm open to suggestions). A
probably slightly ugly alternative could be to define a function such
as:
get_ff_image_copy_plane_from_uswc_fn()

returning a pointer to the correct function.

[...]
  diff --git a/libavutil/x86/imgutils.c b/libavutil/x86/imgutils.c
  new file mode 100644
  index 000..91c7a42
  --- /dev/null
  +++ b/libavutil/x86/imgutils.c
  @@ -0,0 +1,109 @@
  +/*
  + * This file is part of FFmpeg.
  + *
  + * FFmpeg is free software; you can redistribute it and/or
  + * modify it under the terms of the GNU Lesser General Public
  + * License as published by the Free Software Foundation; either
  + * version 2.1 of the License, or (at your option) any later version.
  + *
  + * FFmpeg is distributed in the hope that it will be useful,
  + * but WITHOUT ANY WARRANTY; without even the implied warranty of
  + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
  + * Lesser General Public License for more details.
  + *
  + * You should have received a copy of the GNU Lesser General Public
  + * License along with FFmpeg; if not, write to the Free Software
  + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 
  02110-1301 USA
  + */
  +
  +#include inttypes.h
  +#include config.h
  +#include libavutil/attributes.h
  +#include libavutil/avassert.h
  +#include libavutil/intreadwrite.h
  +#include libavutil/x86/asm.h
  +#include libavutil/x86/cpu.h
  +#include libavutil/cpu.h
  +#include libavutil/pixdesc.h
  +
  +#include libavutil/avassert.h
  +#include libavutil/x86/asm.h
  +#include libavutil/imgutils.h
  +#include libavutil/imgutils_internal.h
  +
  +#ifdef HAVE_SSE2
  +/* Copy 16/64 bytes from srcp to dstp loading data with the SSE=2 
  instruction
  + * load and storing data with the SSE=2 instruction store.
  + */
  +#define COPY16(dstp, srcp, load, store) \
  +__asm__ volatile (  \
  +load   0(%[src]), %%xmm1\n\
  +store  %%xmm1,0(%[dst])\n \
  +: : [dst]r(dstp), [src]r(srcp) : memory, xmm1)
  +
  +#define COPY64(dstp, srcp, load, store) \
  +__asm__ volatile (  \
  +load   0(%[src]), %%xmm1\n\
  +load  16(%[src]), %%xmm2\n\
  +load  32(%[src]), %%xmm3\n\
  +load  48(%[src]), %%xmm4\n\
  +store  %%xmm1,0(%[dst])\n \
  +store  %%xmm2,   16(%[dst])\n \
  +store  %%xmm3,   32(%[dst])\n \
  +store  %%xmm4,   48(%[dst])\n \
  +: : [dst]r(dstp), [src]r(srcp) : memory, xmm1, xmm2, 
  xmm3, xmm4)
  +#endif
 

 As already mentioned, this should be done in nasm/yasm syntax.
 Also, any reason you're not using more xmm registers to reduce the amount of 

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-05-29 Thread Timothy Gu
On Fri, May 29, 2015 at 03:49:22PM +0200, Stefano Sabatini wrote:
 @@ -405,3 +406,16 @@ int av_image_copy_to_buffer(uint8_t *dst, int dst_size,
  
  return size;
  }
 +
 +void av_image_copy_plane_from_uswc(uint8_t *dst, size_t dst_linesize,
 +const uint8_t *src, size_t src_linesize,
 +unsigned bytewidth, unsigned height,
 +int cpu_flags)
 +{
 +#if !HAVE_SSSE3

 +av_unused(cpu_flags);

av_used has a different definition than VLC_UNUSED. Just use a (void) cast.

 +av_image_copy_plane(dst, dst_linesize, src, src_linesize, bytewidth, 
 height);
 +#else
 +ff_image_copy_plane_from_uswc_x86(dst, dst_linesize, src, src_linesize, 
 bytewidth, height, cpu_flags);
 +#endif
 +}
 diff --git a/libavutil/imgutils.h b/libavutil/imgutils.h
 index 23282a3..184e1e7 100644
 --- a/libavutil/imgutils.h
 +++ b/libavutil/imgutils.h
 @@ -111,6 +111,24 @@ void av_image_copy_plane(uint8_t   *dst, int 
 dst_linesize,
   int bytewidth, int height);
  
  /**
 + * Copy image plane from src to dst, similar to av_image_copy_plane().
 + * src must be an USWC buffer.
 + * It performs optimized copy from Uncacheable Speculative Write
 + * Combining memory as used by some video surface.
 + * It is really efficient only when SSE4.1 is available.
 + *
 + * In case the target CPU does not support USWC caching this function
 + * will be equivalent to av_image_copy_plane().
 + *
 + * @param cpu_flags as returned by av_get_cpu_flags()
 + * @see av_image_copy_plane()
 + */
 +void av_image_copy_plane_from_uswc(uint8_t *dst, size_t dst_linesize,
 +   const uint8_t *src, size_t src_linesize,
 +   unsigned bytewidth, unsigned height,
 +   int cpu_flags);
 +
 +/**
   * Copy image in src_data to dst_data.
   *
   * @param dst_linesizes linesizes for the image in dst_data
 diff --git a/libavutil/imgutils_internal.h b/libavutil/imgutils_internal.h
 new file mode 100644
 index 000..9576afe
 --- /dev/null
 +++ b/libavutil/imgutils_internal.h
 @@ -0,0 +1,29 @@
 +/*
 + * This file is part of FFmpeg.
 + *
 + * FFmpeg is free software; you can redistribute it and/or
 + * modify it under the terms of the GNU Lesser General Public
 + * License as published by the Free Software Foundation; either
 + * version 2.1 of the License, or (at your option) any later version.
 + *
 + * FFmpeg is distributed in the hope that it will be useful,
 + * but WITHOUT ANY WARRANTY; without even the implied warranty of
 + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 + * Lesser General Public License for more details.
 + *
 + * You should have received a copy of the GNU Lesser General Public
 + * License along with FFmpeg; if not, write to the Free Software
 + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 
 USA
 + */
 +
 +#ifndef AVUTIL_IMGUTILS_INTERNAL_H
 +#define AVUTIL_IMGUTILS_INTERNAL_H
 +
 +#include imgutils.h
 +
 +void ff_image_copy_plane_from_uswc_x86(uint8_t *dst, size_t dst_linesize,
 +const uint8_t *src, size_t src_linesize,
 +unsigned bytewidth, unsigned height,
 +int cpu_flags);
 +
 +#endif /* AVUTIL_IMGUTILS_INTERNAL_H */
 diff --git a/libavutil/x86/Makefile b/libavutil/x86/Makefile
 index eb70a62..a719c00 100644
 --- a/libavutil/x86/Makefile
 +++ b/libavutil/x86/Makefile
 @@ -1,5 +1,6 @@
  OBJS += x86/cpu.o   \
  x86/float_dsp_init.o\
 +x86/imgutils.o  \
  x86/lls_init.o  \
  
  OBJS-$(CONFIG_PIXELUTILS) += x86/pixelutils_init.o  \
 diff --git a/libavutil/x86/imgutils.c b/libavutil/x86/imgutils.c
 new file mode 100644
 index 000..8b3ed0f
 --- /dev/null
 +++ b/libavutil/x86/imgutils.c
 @@ -0,0 +1,95 @@
 +/*
 + * This file is part of FFmpeg.
 + *
 + * FFmpeg is free software; you can redistribute it and/or
 + * modify it under the terms of the GNU Lesser General Public
 + * License as published by the Free Software Foundation; either
 + * version 2.1 of the License, or (at your option) any later version.
 + *
 + * FFmpeg is distributed in the hope that it will be useful,
 + * but WITHOUT ANY WARRANTY; without even the implied warranty of
 + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 + * Lesser General Public License for more details.
 + *
 + * You should have received a copy of the GNU Lesser General Public
 + * License along with FFmpeg; if not, write to the Free Software
 + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 
 USA
 + */
 +
 +#include inttypes.h
 +#include config.h
 +#include 

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-05-28 Thread James Almer
On 28/05/15 2:39 PM, Stefano Sabatini wrote:
 From f3b4e77dd9dd299aba8f4fa83625d2b61b243c3c Mon Sep 17 00:00:00 2001
 From: Stefano Sabatini stefa...@gmail.com
 Date: Fri, 15 May 2015 18:58:17 +0200
 Subject: [PATCH] lavu/imgutils: add av_image_copy_plane_from_uswc() function.
 
 This function allows support to optimized GPU to CPU.
 
 Based on code from vlc dxva2.c, commit 62107e56 by Laurent Aimar
 fen...@videolan.org.
 
 TODO: fix integration with the build system, bump micro
 
 Signed-off-by: Stefano Sabatini stefa...@gmail.com
 ---
  libavutil/imgutils.c  |  14 ++
  libavutil/imgutils.h  |  18 +++
  libavutil/imgutils_internal.h |  29 +++
  libavutil/x86/Makefile|   1 +
  libavutil/x86/imgutils.c  | 109 
 ++
  5 files changed, 171 insertions(+)
  create mode 100644 libavutil/imgutils_internal.h
  create mode 100644 libavutil/x86/imgutils.c
 
 diff --git a/libavutil/imgutils.c b/libavutil/imgutils.c
 index ef0e671..e538c75 100644
 --- a/libavutil/imgutils.c
 +++ b/libavutil/imgutils.c
 @@ -30,6 +30,7 @@
  #include mathematics.h
  #include pixdesc.h
  #include rational.h
 +#include imgutils_internal.h
  
  void av_image_fill_max_pixsteps(int max_pixsteps[4], int 
 max_pixstep_comps[4],
  const AVPixFmtDescriptor *pixdesc)
 @@ -405,3 +406,16 @@ int av_image_copy_to_buffer(uint8_t *dst, int dst_size,
  
  return size;
  }
 +
 +void av_image_copy_plane_from_uswc(uint8_t *dst, size_t dst_linesize,
 +const uint8_t *src, size_t src_linesize,
 +unsigned bytewidth, unsigned height,
 +unsigned cpu_flags)
 +{
 +#ifndef HAVE_SSSE3

All HAVE_ are always defined to either 0 or 1.

Nonetheless, this kind of check does not belong outside of arch folders. You 
should
check for ARCH_X86 to call functions in the x86/ folder. See lavc/lavfi for 
examples.

 +av_unused(cpu_flags);
 +av_image_copy_plane(dst, dst_linesize, src, src_linesize, bytewidth, 
 height);
 +#else
 +ff_image_copy_plane_from_uswc_x86(dst, dst_linesize, src, src_linesize, 
 bytewidth, height, cpu_flags);
 +#endif
 +}
 diff --git a/libavutil/imgutils.h b/libavutil/imgutils.h
 index 23282a3..82c3826 100644
 --- a/libavutil/imgutils.h
 +++ b/libavutil/imgutils.h
 @@ -111,6 +111,24 @@ void av_image_copy_plane(uint8_t   *dst, int 
 dst_linesize,
   int bytewidth, int height);
  
  /**
 + * Copy image plane from src to dst, similar to av_image_copy_plane().
 + * src must be an USWC buffer.
 + * It performs optimized copy from Uncacheable Speculative Write
 + * Combining memory as used by some video surface.
 + * It is really efficient only when SSE4.1 is available.
 + *
 + * In case the target CPU does not support USWC caching this function
 + * will be equivalent to av_image_copy_plane().
 + *
 + * @param cpu_flags as returned by av_get_cpu_flags()
 + * @see av_image_copy_plane()
 + */
 +void av_image_copy_plane_from_uswc(uint8_t *dst, size_t dst_linesize,
 +   const uint8_t *src, size_t src_linesize,
 +   unsigned bytewidth, unsigned height,
 +   unsigned cpu_flags);
 +
 +/**
   * Copy image in src_data to dst_data.
   *
   * @param dst_linesizes linesizes for the image in dst_data
 diff --git a/libavutil/imgutils_internal.h b/libavutil/imgutils_internal.h
 new file mode 100644
 index 000..16ed977
 --- /dev/null
 +++ b/libavutil/imgutils_internal.h
 @@ -0,0 +1,29 @@
 +/*
 + * This file is part of FFmpeg.
 + *
 + * FFmpeg is free software; you can redistribute it and/or
 + * modify it under the terms of the GNU Lesser General Public
 + * License as published by the Free Software Foundation; either
 + * version 2.1 of the License, or (at your option) any later version.
 + *
 + * FFmpeg is distributed in the hope that it will be useful,
 + * but WITHOUT ANY WARRANTY; without even the implied warranty of
 + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 + * Lesser General Public License for more details.
 + *
 + * You should have received a copy of the GNU Lesser General Public
 + * License along with FFmpeg; if not, write to the Free Software
 + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 
 USA
 + */
 +
 +#ifndef AVUTIL_IMGUTILS_INTERNAL_H
 +#define AVUTIL_IMGUTILS_INTERNAL_H
 +
 +#include imgutils.h
 +
 +void ff_image_copy_plane_from_uswc_x86(uint8_t *dst, size_t dst_linesize,
 +const uint8_t *src, size_t src_linesize,
 +unsigned bytewidth, unsigned height,
 +unsigned cpu_flags);
 +
 +#endif /* AVUTIL_IMGUTILS_INTERNAL_H */
 diff --git a/libavutil/x86/Makefile b/libavutil/x86/Makefile
 index eb70a62..a719c00 100644
 --- a/libavutil/x86/Makefile
 +++ 

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-05-28 Thread Stefano Sabatini
On date Monday 2015-05-18 13:26:56 +0200, Stefano Sabatini encoded:
 On Mon, May 18, 2015 at 1:17 PM, Hendrik Leppkes h.lepp...@gmail.com
 wrote:
 
  On Mon, May 18, 2015 at 12:37 PM, Stefano Sabatini stefa...@gmail.com
  wrote:
 
 [...]
 
  
   I have a first hackish patch, performed some tests and I got some
   significant performance gains, on my iCore5 with Intel Graphics HD4000 I
   have now the same performance as the software decoder using DXVA2 for
   decoding a H.264 1920x1080 video, but using only a single thread. The
  patch
   as is is a hack, since I had to modify the compilation flags to enable
   assembly compilation in the ffmpeg_dxva2.c file. I should probably create
   an optimized copy function in libavutil, comments are welcome.
 
  FWIW, I never saw any benefits from using a small cache over simply
  copying directly to the destination memory, that could potentially
  simplify this a bit.
 
 
 
  And yeah, its a huge hack, we don't want new inline assembly.
 
 
 The sanest approach is probably to add a function to libavutil. The
 optimized copy would then be accessible to third-party library users, with
 no assembly hacks involved.

New patch attached, it's still somehow hackish, please advice if you
consider this approach acceptable.
-- 
FFmpeg = Formidable and Friendly MultiPurpose Explosive Game
From f3b4e77dd9dd299aba8f4fa83625d2b61b243c3c Mon Sep 17 00:00:00 2001
From: Stefano Sabatini stefa...@gmail.com
Date: Fri, 15 May 2015 18:58:17 +0200
Subject: [PATCH] lavu/imgutils: add av_image_copy_plane_from_uswc() function.

This function allows support to optimized GPU to CPU.

Based on code from vlc dxva2.c, commit 62107e56 by Laurent Aimar
fen...@videolan.org.

TODO: fix integration with the build system, bump micro

Signed-off-by: Stefano Sabatini stefa...@gmail.com
---
 libavutil/imgutils.c  |  14 ++
 libavutil/imgutils.h  |  18 +++
 libavutil/imgutils_internal.h |  29 +++
 libavutil/x86/Makefile|   1 +
 libavutil/x86/imgutils.c  | 109 ++
 5 files changed, 171 insertions(+)
 create mode 100644 libavutil/imgutils_internal.h
 create mode 100644 libavutil/x86/imgutils.c

diff --git a/libavutil/imgutils.c b/libavutil/imgutils.c
index ef0e671..e538c75 100644
--- a/libavutil/imgutils.c
+++ b/libavutil/imgutils.c
@@ -30,6 +30,7 @@
 #include mathematics.h
 #include pixdesc.h
 #include rational.h
+#include imgutils_internal.h
 
 void av_image_fill_max_pixsteps(int max_pixsteps[4], int max_pixstep_comps[4],
 const AVPixFmtDescriptor *pixdesc)
@@ -405,3 +406,16 @@ int av_image_copy_to_buffer(uint8_t *dst, int dst_size,
 
 return size;
 }
+
+void av_image_copy_plane_from_uswc(uint8_t *dst, size_t dst_linesize,
+   const uint8_t *src, size_t src_linesize,
+   unsigned bytewidth, unsigned height,
+   unsigned cpu_flags)
+{
+#ifndef HAVE_SSSE3
+av_unused(cpu_flags);
+av_image_copy_plane(dst, dst_linesize, src, src_linesize, bytewidth, height);
+#else
+ff_image_copy_plane_from_uswc_x86(dst, dst_linesize, src, src_linesize, bytewidth, height, cpu_flags);
+#endif
+}
diff --git a/libavutil/imgutils.h b/libavutil/imgutils.h
index 23282a3..82c3826 100644
--- a/libavutil/imgutils.h
+++ b/libavutil/imgutils.h
@@ -111,6 +111,24 @@ void av_image_copy_plane(uint8_t   *dst, int dst_linesize,
  int bytewidth, int height);
 
 /**
+ * Copy image plane from src to dst, similar to av_image_copy_plane().
+ * src must be an USWC buffer.
+ * It performs optimized copy from Uncacheable Speculative Write
+ * Combining memory as used by some video surface.
+ * It is really efficient only when SSE4.1 is available.
+ *
+ * In case the target CPU does not support USWC caching this function
+ * will be equivalent to av_image_copy_plane().
+ *
+ * @param cpu_flags as returned by av_get_cpu_flags()
+ * @see av_image_copy_plane()
+ */
+void av_image_copy_plane_from_uswc(uint8_t *dst, size_t dst_linesize,
+   const uint8_t *src, size_t src_linesize,
+   unsigned bytewidth, unsigned height,
+   unsigned cpu_flags);
+
+/**
  * Copy image in src_data to dst_data.
  *
  * @param dst_linesizes linesizes for the image in dst_data
diff --git a/libavutil/imgutils_internal.h b/libavutil/imgutils_internal.h
new file mode 100644
index 000..16ed977
--- /dev/null
+++ b/libavutil/imgutils_internal.h
@@ -0,0 +1,29 @@
+/*
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS 

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-05-28 Thread Hendrik Leppkes
On Thu, May 28, 2015 at 7:39 PM, Stefano Sabatini stefa...@gmail.com wrote:
 On date Monday 2015-05-18 13:26:56 +0200, Stefano Sabatini encoded:
 On Mon, May 18, 2015 at 1:17 PM, Hendrik Leppkes h.lepp...@gmail.com
 wrote:

  On Mon, May 18, 2015 at 12:37 PM, Stefano Sabatini stefa...@gmail.com
  wrote:
 
 [...]

  
   I have a first hackish patch, performed some tests and I got some
   significant performance gains, on my iCore5 with Intel Graphics HD4000 I
   have now the same performance as the software decoder using DXVA2 for
   decoding a H.264 1920x1080 video, but using only a single thread. The
  patch
   as is is a hack, since I had to modify the compilation flags to enable
   assembly compilation in the ffmpeg_dxva2.c file. I should probably create
   an optimized copy function in libavutil, comments are welcome.
 
  FWIW, I never saw any benefits from using a small cache over simply
  copying directly to the destination memory, that could potentially
  simplify this a bit.
 


  And yeah, its a huge hack, we don't want new inline assembly.
 

 The sanest approach is probably to add a function to libavutil. The
 optimized copy would then be accessible to third-party library users, with
 no assembly hacks involved.

 New patch attached, it's still somehow hackish, please advice if you
 consider this approach acceptable.


The general concept is fine, but it should not use inline asm, and
someone will want to argue about the name and placement etc... :)
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-05-18 Thread Reimar Döffinger


On 18.05.2015, at 12:37, Stefano Sabatini stefa...@gmail.com wrote:

 On Thu, May 14, 2015 at 2:52 PM, Stefano Sabatini stefa...@gmail.com
 wrote:
 
 On date Thursday 2015-05-14 13:01:51 +0200, Stefano Sabatini encoded:
 On date Tuesday 2015-05-12 15:54:17 +0200, Hendrik Leppkes encoded:
 [...]
 One limitation is as the manual said, it needs to be copied from the
 GPU to system memory. ffmpeg_dxva2.c does not implement a optimized
 copy function for this, it uses plain old memcpy.
 Intel introduced a new instruction for this in SSE4, MOVNTDQA, which
 is optimized for copying from USWC memory (Uncacheable Speculative
 Write Combining) to system memory. Using this may help speed up the
 process significantly, and VLC probably uses it.
 
 Now the question is, how would be possible to optimize GPU to CPU copy
 to get an overall performance gain? At least VLC seems able to get
 better performances when using HW decoding, but I'm not sure it is
 copying decoded data back to the CPU (indeed it may perform direct
 rendering).
 
 Self-reply:
 commit 62107e563f979c638f9a5f58cdfd5639d9c63ac7
 Author: Laurent Aimar fen...@videolan.org
 Date:   Tue Nov 17 01:09:43 2009 +0100
 
Improved performance when copying video surface in dxva2.
 
 That is, VLC is using optimized GPU-CPU copy when the relevant SSE2
 instructions are available.
 
 
 I have a first hackish patch, performed some tests and I got some
 significant performance gains, on my iCore5 with Intel Graphics HD4000 I
 have now the same performance as the software decoder using DXVA2 for
 decoding a H.264 1920x1080 video, but using only a single thread. The patch
 as is is a hack, since I had to modify the compilation flags to enable
 assembly compilation in the ffmpeg_dxva2.c file. I should probably create
 an optimized copy function in libavutil, comments are welcome.

What exactly is SSE4 needed for?
Both non-temporal movs and prefetches existed before it, so if that is critical 
for performance the fallback implementation is bad.
However possibly more important: why is a memcpy needed at all?
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-05-18 Thread Hendrik Leppkes
On Mon, May 18, 2015 at 9:41 PM, Reimar Döffinger
reimar.doeffin...@gmx.de wrote:


 On 18.05.2015, at 12:37, Stefano Sabatini stefa...@gmail.com wrote:

 On Thu, May 14, 2015 at 2:52 PM, Stefano Sabatini stefa...@gmail.com
 wrote:

 On date Thursday 2015-05-14 13:01:51 +0200, Stefano Sabatini encoded:
 On date Tuesday 2015-05-12 15:54:17 +0200, Hendrik Leppkes encoded:
 [...]
 One limitation is as the manual said, it needs to be copied from the
 GPU to system memory. ffmpeg_dxva2.c does not implement a optimized
 copy function for this, it uses plain old memcpy.
 Intel introduced a new instruction for this in SSE4, MOVNTDQA, which
 is optimized for copying from USWC memory (Uncacheable Speculative
 Write Combining) to system memory. Using this may help speed up the
 process significantly, and VLC probably uses it.

 Now the question is, how would be possible to optimize GPU to CPU copy
 to get an overall performance gain? At least VLC seems able to get
 better performances when using HW decoding, but I'm not sure it is
 copying decoded data back to the CPU (indeed it may perform direct
 rendering).

 Self-reply:
 commit 62107e563f979c638f9a5f58cdfd5639d9c63ac7
 Author: Laurent Aimar fen...@videolan.org
 Date:   Tue Nov 17 01:09:43 2009 +0100

Improved performance when copying video surface in dxva2.

 That is, VLC is using optimized GPU-CPU copy when the relevant SSE2
 instructions are available.


 I have a first hackish patch, performed some tests and I got some
 significant performance gains, on my iCore5 with Intel Graphics HD4000 I
 have now the same performance as the software decoder using DXVA2 for
 decoding a H.264 1920x1080 video, but using only a single thread. The patch
 as is is a hack, since I had to modify the compilation flags to enable
 assembly compilation in the ffmpeg_dxva2.c file. I should probably create
 an optimized copy function in libavutil, comments are welcome.

 What exactly is SSE4 needed for?

MOVNTDQA, its specifically designed for just this task.

 Both non-temporal movs and prefetches existed before it, so if that is 
 critical for performance the fallback implementation is bad.

A SSE2 implementation may or may not be faster than plain memcpy, that
depends on memcpy. In my tests on Windows, a SSE2 implementation was
usually not worth it.

 However possibly more important: why is a memcpy needed at all?

For any further processing, you need the frame data. And trying to use
the frame data directly from the locked surfaces for eg. an encoder is
very inefficient (possibly random access pattern), so it needs to be
copied into normal memory first.

- Hendrik
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-05-18 Thread Stefano Sabatini
On Thu, May 14, 2015 at 2:52 PM, Stefano Sabatini stefa...@gmail.com
wrote:

 On date Thursday 2015-05-14 13:01:51 +0200, Stefano Sabatini encoded:
  On date Tuesday 2015-05-12 15:54:17 +0200, Hendrik Leppkes encoded:
 [...]
   One limitation is as the manual said, it needs to be copied from the
   GPU to system memory. ffmpeg_dxva2.c does not implement a optimized
   copy function for this, it uses plain old memcpy.
   Intel introduced a new instruction for this in SSE4, MOVNTDQA, which
   is optimized for copying from USWC memory (Uncacheable Speculative
   Write Combining) to system memory. Using this may help speed up the
   process significantly, and VLC probably uses it.
 
  Now the question is, how would be possible to optimize GPU to CPU copy
  to get an overall performance gain? At least VLC seems able to get
  better performances when using HW decoding, but I'm not sure it is
  copying decoded data back to the CPU (indeed it may perform direct
  rendering).

 Self-reply:
 commit 62107e563f979c638f9a5f58cdfd5639d9c63ac7
 Author: Laurent Aimar fen...@videolan.org
 Date:   Tue Nov 17 01:09:43 2009 +0100

 Improved performance when copying video surface in dxva2.

 That is, VLC is using optimized GPU-CPU copy when the relevant SSE2
 instructions are available.


I have a first hackish patch, performed some tests and I got some
significant performance gains, on my iCore5 with Intel Graphics HD4000 I
have now the same performance as the software decoder using DXVA2 for
decoding a H.264 1920x1080 video, but using only a single thread. The patch
as is is a hack, since I had to modify the compilation flags to enable
assembly compilation in the ffmpeg_dxva2.c file. I should probably create
an optimized copy function in libavutil, comments are welcome.

The IDirect3D9_CreateDevice(... GetShellWindow ...) - ..GetDesktopWindow
change is required to make it compile under MinGW (with MinGW64 it is
probably not required, I still have to switch to MinGW64 but allowing MinGW
compilation is still worthwhile).


0001-ffmpeg_dxva.c-add-support-to-optimized-GPU-to-CPU-co.patch
Description: Binary data
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-05-18 Thread Hendrik Leppkes
On Mon, May 18, 2015 at 12:37 PM, Stefano Sabatini stefa...@gmail.com wrote:
 On Thu, May 14, 2015 at 2:52 PM, Stefano Sabatini stefa...@gmail.com
 wrote:

 On date Thursday 2015-05-14 13:01:51 +0200, Stefano Sabatini encoded:
  On date Tuesday 2015-05-12 15:54:17 +0200, Hendrik Leppkes encoded:
 [...]
   One limitation is as the manual said, it needs to be copied from the
   GPU to system memory. ffmpeg_dxva2.c does not implement a optimized
   copy function for this, it uses plain old memcpy.
   Intel introduced a new instruction for this in SSE4, MOVNTDQA, which
   is optimized for copying from USWC memory (Uncacheable Speculative
   Write Combining) to system memory. Using this may help speed up the
   process significantly, and VLC probably uses it.
 
  Now the question is, how would be possible to optimize GPU to CPU copy
  to get an overall performance gain? At least VLC seems able to get
  better performances when using HW decoding, but I'm not sure it is
  copying decoded data back to the CPU (indeed it may perform direct
  rendering).

 Self-reply:
 commit 62107e563f979c638f9a5f58cdfd5639d9c63ac7
 Author: Laurent Aimar fen...@videolan.org
 Date:   Tue Nov 17 01:09:43 2009 +0100

 Improved performance when copying video surface in dxva2.

 That is, VLC is using optimized GPU-CPU copy when the relevant SSE2
 instructions are available.


 I have a first hackish patch, performed some tests and I got some
 significant performance gains, on my iCore5 with Intel Graphics HD4000 I
 have now the same performance as the software decoder using DXVA2 for
 decoding a H.264 1920x1080 video, but using only a single thread. The patch
 as is is a hack, since I had to modify the compilation flags to enable
 assembly compilation in the ffmpeg_dxva2.c file. I should probably create
 an optimized copy function in libavutil, comments are welcome.

FWIW, I never saw any benefits from using a small cache over simply
copying directly to the destination memory, that could potentially
simplify this a bit.
And yeah, its a huge hack, we don't want new inline assembly.

- Hendrik
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-05-18 Thread Stefano Sabatini
On Mon, May 18, 2015 at 1:17 PM, Hendrik Leppkes h.lepp...@gmail.com
wrote:

 On Mon, May 18, 2015 at 12:37 PM, Stefano Sabatini stefa...@gmail.com
 wrote:

[...]

 
  I have a first hackish patch, performed some tests and I got some
  significant performance gains, on my iCore5 with Intel Graphics HD4000 I
  have now the same performance as the software decoder using DXVA2 for
  decoding a H.264 1920x1080 video, but using only a single thread. The
 patch
  as is is a hack, since I had to modify the compilation flags to enable
  assembly compilation in the ffmpeg_dxva2.c file. I should probably create
  an optimized copy function in libavutil, comments are welcome.

 FWIW, I never saw any benefits from using a small cache over simply
 copying directly to the destination memory, that could potentially
 simplify this a bit.



 And yeah, its a huge hack, we don't want new inline assembly.


The sanest approach is probably to add a function to libavutil. The
optimized copy would then be accessible to third-party library users, with
no assembly hacks involved.
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-05-14 Thread Stefano Sabatini
On date Thursday 2015-05-14 13:01:51 +0200, Stefano Sabatini encoded:
 On date Tuesday 2015-05-12 15:54:17 +0200, Hendrik Leppkes encoded:
[...]
  One limitation is as the manual said, it needs to be copied from the
  GPU to system memory. ffmpeg_dxva2.c does not implement a optimized
  copy function for this, it uses plain old memcpy.
  Intel introduced a new instruction for this in SSE4, MOVNTDQA, which
  is optimized for copying from USWC memory (Uncacheable Speculative
  Write Combining) to system memory. Using this may help speed up the
  process significantly, and VLC probably uses it.
 
 Now the question is, how would be possible to optimize GPU to CPU copy
 to get an overall performance gain? At least VLC seems able to get
 better performances when using HW decoding, but I'm not sure it is
 copying decoded data back to the CPU (indeed it may perform direct
 rendering).

Self-reply:
commit 62107e563f979c638f9a5f58cdfd5639d9c63ac7
Author: Laurent Aimar fen...@videolan.org
Date:   Tue Nov 17 01:09:43 2009 +0100

Improved performance when copying video surface in dxva2.

That is, VLC is using optimized GPU-CPU copy when the relevant SSE2
instructions are available.
-- 
FFmpeg = Fundamental  Frightening Mean Peaceful EniGma
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-05-14 Thread wm4
On Thu, 14 May 2015 14:52:29 +0200
Stefano Sabatini stefa...@gmail.com wrote:

 On date Thursday 2015-05-14 13:01:51 +0200, Stefano Sabatini encoded:
  On date Tuesday 2015-05-12 15:54:17 +0200, Hendrik Leppkes encoded:
 [...]
   One limitation is as the manual said, it needs to be copied from the
   GPU to system memory. ffmpeg_dxva2.c does not implement a optimized
   copy function for this, it uses plain old memcpy.
   Intel introduced a new instruction for this in SSE4, MOVNTDQA, which
   is optimized for copying from USWC memory (Uncacheable Speculative
   Write Combining) to system memory. Using this may help speed up the
   process significantly, and VLC probably uses it.
  
  Now the question is, how would be possible to optimize GPU to CPU copy
  to get an overall performance gain? At least VLC seems able to get
  better performances when using HW decoding, but I'm not sure it is
  copying decoded data back to the CPU (indeed it may perform direct
  rendering).
 
 Self-reply:
 commit 62107e563f979c638f9a5f58cdfd5639d9c63ac7
 Author: Laurent Aimar fen...@videolan.org
 Date:   Tue Nov 17 01:09:43 2009 +0100
 
 Improved performance when copying video surface in dxva2.
 
 That is, VLC is using optimized GPU-CPU copy when the relevant SSE2
 instructions are available.

Here's what lavfilters appears to use:

http://git.1f0.de/gitweb?p=lavfsplitter.git;a=blob;f=common/DSUtilLite/gpu_memcpy_sse4.h
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-05-14 Thread Hendrik Leppkes
On Thu, May 14, 2015 at 2:52 PM, Stefano Sabatini stefa...@gmail.com wrote:
 On date Thursday 2015-05-14 13:01:51 +0200, Stefano Sabatini encoded:
 On date Tuesday 2015-05-12 15:54:17 +0200, Hendrik Leppkes encoded:
 [...]
  One limitation is as the manual said, it needs to be copied from the
  GPU to system memory. ffmpeg_dxva2.c does not implement a optimized
  copy function for this, it uses plain old memcpy.
  Intel introduced a new instruction for this in SSE4, MOVNTDQA, which
  is optimized for copying from USWC memory (Uncacheable Speculative
  Write Combining) to system memory. Using this may help speed up the
  process significantly, and VLC probably uses it.

 Now the question is, how would be possible to optimize GPU to CPU copy
 to get an overall performance gain? At least VLC seems able to get
 better performances when using HW decoding, but I'm not sure it is
 copying decoded data back to the CPU (indeed it may perform direct
 rendering).

 Self-reply:
 commit 62107e563f979c638f9a5f58cdfd5639d9c63ac7
 Author: Laurent Aimar fen...@videolan.org
 Date:   Tue Nov 17 01:09:43 2009 +0100

 Improved performance when copying video surface in dxva2.

 That is, VLC is using optimized GPU-CPU copy when the relevant SSE2
 instructions are available.

Actually the real proper instructions are SSE4.1, using SSE2 would
only be a small advantage over memcpy.

- Hendrik
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-05-14 Thread Stefano Sabatini
On date Tuesday 2015-05-12 15:54:17 +0200, Hendrik Leppkes encoded:
 On Tue, May 12, 2015 at 3:33 PM, Stefano Sabatini stefa...@gmail.com wrote:
[...]
  There are some cases when DXVA2 (or in general HW decoding) can be
  used effectively in ffmpeg? Can you tell if there is something which
  could be improved in the current ffmpeg_dxva2.c implementation? (My
  guess is that this code is somehow based on the VLC code).
 
 Its not based on the VLC code, its roughly based on code from my own
 project that uses ffmpeg for DXVA2, but really, the workflow is going
 to be pretty similar in any implementation either way, since the MS
 API dictates that, more or less.
 
 DXVA2 decoding can be faster then software decoding, depending on your 
 hardware.
 
 If you used a low-end Intel CPU, say a Pentium or i3 (Ivy or Haswell),
 or use a recent NVIDIA GPU (Kepler or Maxwell), then DXVA2 decoding on
 the GPU can potentially give you ~400 fps for 1080p, while the CPU
 will likely not manage that.
 On a high-end CPU, the software decoder can potentially exceed that, however.
 
 One limitation is as the manual said, it needs to be copied from the
 GPU to system memory. ffmpeg_dxva2.c does not implement a optimized
 copy function for this, it uses plain old memcpy.
 Intel introduced a new instruction for this in SSE4, MOVNTDQA, which
 is optimized for copying from USWC memory (Uncacheable Speculative
 Write Combining) to system memory. Using this may help speed up the
 process significantly, and VLC probably uses it.

Now the question is, how would be possible to optimize GPU to CPU copy
to get an overall performance gain? At least VLC seems able to get
better performances when using HW decoding, but I'm not sure it is
copying decoded data back to the CPU (indeed it may perform direct
rendering).
 
 The original primary goal of this code was however to be able to test
 and debug the hwaccels much easier, and not directly to provide a
 playback/transcoding feature, so such optimizations were not performed
 for brevity.
[...]

Thanks.
-- 
FFmpeg = Fanciful  Faithless Merciless Powerful EntanGlement
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-05-12 Thread Stefano Sabatini
Hi guys,

I'm playing with DXVA2 hardware decoding on Windows, and these are my
findings.

DVXA2 decoding was enabled in avconv/ffmpeg through the commit:

commit 35177ba77ff60a8b8839783f57e44bcc4214507a
Author: Hendrik Leppkes h.lepp...@gmail.com
Date:   Tue Apr 22 15:22:53 2014 +0200

avconv: add support for DXVA2 decoding

Signed-off-by: Anton Khirnov an...@khirnov.net

DXVA2 decoding is enabled when a dxva2api.h header is found in the
path. From my understanding the header is provided by VLC:
http://download.videolan.org/pub/contrib/dxva2api.h

(I suppose the header was created in order to make compilation work
with MinGW). When compiling with MinGW from mingw.org I had to change
the GetShellWindow call in the line:

hr = IDirect3D9_CreateDevice(ctx-d3d9, adapter, D3DDEVTYPE_HAL, 
GetShellWindow(),
 D3DCREATE_SOFTWARE_VERTEXPROCESSING | 
D3DCREATE_MULTITHREADED | D3DCREATE_FPU_PRESERVE,
 d3dpp, ctx-d3d9device);

to GetDesktopWindow in the ffmpeg_dxva2.c file. I applied the fix
suggested here:
http://ffmpeg.org/pipermail/libav-user/2014-December/007673.html

Then I performed some tests with the command:
ffmpeg -hwaccel dxva2 INPUT -threads 1 -f null -

The -threads 1 option seems required or ffmpeg will fail with decoding
errors.

In the ffmpeg(1) manual I can read this big warning:
 Note that most acceleration methods are intended for playback and
 will not be faster than software decoding on modern
 CPUs. Additionally, ffmpeg will usually need to copy the decoded
 frames from the GPU memory into the system memory, resulting in
 further performance loss. This option is thus mainly useful for
 testing.

I tested with several HW combinations, and I always find that pure
software decoding is always several time faster than DXVA2
decoding. In some cases I got invalid output (same with VLC) which may
be related to a problem in the graphics card or driver (a VIA VX900).

On the other hand when testing with VLC I noticed better performances
(in general, a significantly reduced usage of the CPU, usually of an
order of 3), so I have to conclude that at least VLC is able to make
good use of DXVA2 hardware acceleration.

I'm aware that the need to copy GPU data back to the CPU memory as
required by ffmpeg defeats the advantage (if any) of hardware
decoding, especially given that multithreading decoding cannot be
adopted with DXVA2.

My questions are:

There are some cases when DXVA2 (or in general HW decoding) can be
used effectively in ffmpeg? Can you tell if there is something which
could be improved in the current ffmpeg_dxva2.c implementation? (My
guess is that this code is somehow based on the VLC code).

Would it make sense to integrate DXVA2 decoding in ffplay.c, assuming
it would be worth the effort, at least for testing/didactic purposes?

Related resources:
https://trac.ffmpeg.org/ticket/604
https://ffmpeg.org/pipermail/ffmpeg-user/2012-May/006600.html
http://forum.doom9.org/showthread.php?t=170793

TIA for any comments.
-- 
FFmpeg = Fostering and Fantastic Maxi Picky Erudite God
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-05-12 Thread Hendrik Leppkes
On Tue, May 12, 2015 at 3:33 PM, Stefano Sabatini stefa...@gmail.com wrote:
 Hi guys,

 I'm playing with DXVA2 hardware decoding on Windows, and these are my
 findings.

 DVXA2 decoding was enabled in avconv/ffmpeg through the commit:

 commit 35177ba77ff60a8b8839783f57e44bcc4214507a
 Author: Hendrik Leppkes h.lepp...@gmail.com
 Date:   Tue Apr 22 15:22:53 2014 +0200

 avconv: add support for DXVA2 decoding

 Signed-off-by: Anton Khirnov an...@khirnov.net

 DXVA2 decoding is enabled when a dxva2api.h header is found in the
 path. From my understanding the header is provided by VLC:
 http://download.videolan.org/pub/contrib/dxva2api.h

 (I suppose the header was created in order to make compilation work
 with MinGW). When compiling with MinGW from mingw.org I had to change
 the GetShellWindow call in the line:

 hr = IDirect3D9_CreateDevice(ctx-d3d9, adapter, D3DDEVTYPE_HAL, 
 GetShellWindow(),
  D3DCREATE_SOFTWARE_VERTEXPROCESSING | 
 D3DCREATE_MULTITHREADED | D3DCREATE_FPU_PRESERVE,
  d3dpp, ctx-d3d9device);

 to GetDesktopWindow in the ffmpeg_dxva2.c file. I applied the fix
 suggested here:
 http://ffmpeg.org/pipermail/libav-user/2014-December/007673.html

You should use mingw-w64, it provides both a dxva2api.h and can
compile the code without any modifications.
Using the original mingw32 is not recommended, and barely supported.


 Then I performed some tests with the command:
 ffmpeg -hwaccel dxva2 INPUT -threads 1 -f null -

 The -threads 1 option seems required or ffmpeg will fail with decoding
 errors.

Indeed, multi-threading with hwaccel is not something that should be
used, as it will break, although the API allows it for BS reasons.
There wouldn't be a performance improvement either way.


 In the ffmpeg(1) manual I can read this big warning:
  Note that most acceleration methods are intended for playback and
  will not be faster than software decoding on modern
  CPUs. Additionally, ffmpeg will usually need to copy the decoded
  frames from the GPU memory into the system memory, resulting in
  further performance loss. This option is thus mainly useful for
  testing.

 I tested with several HW combinations, and I always find that pure
 software decoding is always several time faster than DXVA2
 decoding. In some cases I got invalid output (same with VLC) which may
 be related to a problem in the graphics card or driver (a VIA VX900).

I don't think I've ever tested on such a chip. I didn't even know VIA
still made PC hardware.
Therefor,I have no idea how fast/slow or compatible it is.


 On the other hand when testing with VLC I noticed better performances
 (in general, a significantly reduced usage of the CPU, usually of an
 order of 3), so I have to conclude that at least VLC is able to make
 good use of DXVA2 hardware acceleration.

 I'm aware that the need to copy GPU data back to the CPU memory as
 required by ffmpeg defeats the advantage (if any) of hardware
 decoding, especially given that multithreading decoding cannot be
 adopted with DXVA2.

 My questions are:

 There are some cases when DXVA2 (or in general HW decoding) can be
 used effectively in ffmpeg? Can you tell if there is something which
 could be improved in the current ffmpeg_dxva2.c implementation? (My
 guess is that this code is somehow based on the VLC code).

Its not based on the VLC code, its roughly based on code from my own
project that uses ffmpeg for DXVA2, but really, the workflow is going
to be pretty similar in any implementation either way, since the MS
API dictates that, more or less.

DXVA2 decoding can be faster then software decoding, depending on your hardware.

If you used a low-end Intel CPU, say a Pentium or i3 (Ivy or Haswell),
or use a recent NVIDIA GPU (Kepler or Maxwell), then DXVA2 decoding on
the GPU can potentially give you ~400 fps for 1080p, while the CPU
will likely not manage that.
On a high-end CPU, the software decoder can potentially exceed that, however.

One limitation is as the manual said, it needs to be copied from the
GPU to system memory. ffmpeg_dxva2.c does not implement a optimized
copy function for this, it uses plain old memcpy.
Intel introduced a new instruction for this in SSE4, MOVNTDQA, which
is optimized for copying from USWC memory (Uncacheable Speculative
Write Combining) to system memory. Using this may help speed up the
process significantly, and VLC probably uses it.

The original primary goal of this code was however to be able to test
and debug the hwaccels much easier, and not directly to provide a
playback/transcoding feature, so such optimizations were not performed
for brevity.

- Hendrik
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel