subject:"\"\\\\\\\[FFmpeg\\\\\\\-devel\\\\\\\] \\\\\\\[RFC\\\\\\\] DXVA2 decoding and FFmpeg\""

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-06-16 Thread Hendrik Leppkes

On Tue, Jun 16, 2015 at 2:30 PM, Stefano Sabatini  wrote:
> On date Tuesday 2015-06-16 14:16:11 +0200, Gwenole Beauchesne encoded:
>> Hi,
>>
>> 2015-06-16 14:03 GMT+02:00 Michael Niedermayer :
> [...]
>> >> +#if HAVE_SSE2
>> >> +/* Copy 16/64 bytes from srcp to dstp loading data with the SSE>=2 
>> >> instruction
>> >> + * load and storing data with the SSE>=2 instruction store.
>> >> + */
>> >> +#define COPY16(dstp, srcp, load, store) \
>> >> +__asm__ volatile (  \
>> >> +load "  0(%[src]), %%xmm1\n"\
>> >> +store " %%xmm1,0(%[dst])\n" \
>> >> +: : [dst]"r"(dstp), [src]"r"(srcp) : "memory", "xmm1")
>> >> +
>> >> +#define COPY64(dstp, srcp, load, store) \
>> >> +__asm__ volatile (  \
>> >> +load "  0(%[src]), %%xmm1\n"\
>> >> +load " 16(%[src]), %%xmm2\n"\
>> >> +load " 32(%[src]), %%xmm3\n"\
>> >> +load " 48(%[src]), %%xmm4\n"\
>> >> +store " %%xmm1,0(%[dst])\n" \
>> >> +store " %%xmm2,   16(%[dst])\n" \
>> >> +store " %%xmm3,   32(%[dst])\n" \
>> >> +store " %%xmm4,   48(%[dst])\n" \
>> >> +: : [dst]"r"(dstp), [src]"r"(srcp) : "memory", "xmm1", "xmm2", 
>> >> "xmm3", "xmm4")
>> >> +#endif
>> >> +
>> >> +#define COPY_LINE(dstp, srcp, size, load)   \
>> >> +const unsigned unaligned = (-(uintptr_t)srcp) & 0x0f;   \
>> >> +unsigned x = unaligned; \
>> >> +\
>> >> +av_assert0(((intptr_t)dstp & 0x0f) == 0);   \
>> >> +\
>> >> +__asm__ volatile ("mfence");\
>> >> +if (!unaligned) {   \
>> >> +for (; x+63 < size; x += 64)\
>> >> +COPY64(&dstp[x], &srcp[x], load, "movdqa"); \
>> >> +} else {\
>> >> +COPY16(dst, src, "movdqu", "movdqa");   \
>> >> +for (; x+63 < size; x += 64)\
>> >> +COPY64(&dstp[x], &srcp[x], load, "movdqu"); \
>> >
>> > to use SSE registers in inline asm operands or clobber list you need
>> > to build with -msse (which probably is default on on x86-64)
>> >
>> > files build with -msse will result in undefined behavior if anything
>> > in them is executed on a pre SSE cpu, as these allow gcc to put
>> > SSE instructions directly in the code where it likes
>> >
>> > The way out of this "design" is not to tell gcc that it passes
>> > a string with SSE code to the assembler
>> > that is not to use SSE registers in operands and not to put them
>> > on the clobber list unless gcc actually is in SSE mode and can use and
>> > need them there.
>> > see XMM_CLOBBERS*
>>
>> Well, from past experience, lying to gcc is generally not a good thing
>> either. There are multiple interesting ways it could fail from time to
>> time. :)
>>
>> Other approaches:
>> - With GCC >= 4.4, you can use __attribute__((target(T))) where T =
>> "ssse3", "sse4.1", etc. This is the easiest way ;
>> - Split into several separate files per target. Though, one would then
>> argue that while we are at it why not just start moving to yasm.
>>
>
>> The former approach looks more appealing to me, considering there may
>> be an effort to migrate to yasm afterwards.
>
> I plan to port this patch to yasm. I'll ask for help on IRC since
> probably it will take too much time otherwise without any guidance.
> --

If you accept a few restrictions (like requiring aligned and padded
input/output) and maybe give it a more specific name so that people
won't try to replace generic memcpy with it, yasm'ing it would be
pretty simple.
If you want it to be generic like the C version, supporting unaligned
and whatnot, the asm is going to get a bit more verbose..

I could probably whip up a basic implementation of the restricted
version, and the yasm experts can make suggestions on improvements
then.

- Hendrik
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-06-16 Thread wm4

On Tue, 16 Jun 2015 14:16:11 +0200
Gwenole Beauchesne  wrote:

> Hi,
> 
> 2015-06-16 14:03 GMT+02:00 Michael Niedermayer :
> > On Tue, Jun 16, 2015 at 10:35:52AM +0200, Stefano Sabatini wrote:
> >> On date Tuesday 2015-06-16 10:20:31 +0200, wm4 encoded:
> >> > On Mon, 15 Jun 2015 17:55:35 +0200
> >> > Stefano Sabatini  wrote:
> >> >
> >> > > On date Monday 2015-06-15 11:56:13 +0200, Stefano Sabatini encoded:
> >> > > [...]
> >> > > > From 3a75ef1e86360cd6f30b8e550307404d0d1c1dba Mon Sep 17 00:00:00 
> >> > > > 2001
> >> > > > From: Stefano Sabatini 
> >> > > > Date: Mon, 15 Jun 2015 11:02:50 +0200
> >> > > > Subject: [PATCH] lavu/mem: add av_memcpynt() function with x86 
> >> > > > optimizations
> >> > > >
> >> > > > Assembly based on code from vlc dxva2.c, commit 62107e56 by Laurent 
> >> > > > Aimar
> >> > > > .
> >> > > >
> >> > > > TODO: bump minor, update APIchanges
> >> > > > ---
> >> > > >  libavutil/mem.c  |  9 +
> >> > > >  libavutil/mem.h  | 14 
> >> > > >  libavutil/mem_internal.h | 26 +++
> >> > > >  libavutil/x86/Makefile   |  1 +
> >> > > >  libavutil/x86/mem.c  | 85 
> >> > > > 
> >> > > >  5 files changed, 135 insertions(+)
> >> > > >  create mode 100644 libavutil/mem_internal.h
> >> > > >  create mode 100644 libavutil/x86/mem.c
> >> > > >
> >> > > > diff --git a/libavutil/mem.c b/libavutil/mem.c
> >> > > > index da291fb..0e1eb01 100644
> >> > > > --- a/libavutil/mem.c
> >> > > > +++ b/libavutil/mem.c
> >> > > > @@ -42,6 +42,7 @@
> >> > > >  #include "dynarray.h"
> >> > > >  #include "intreadwrite.h"
> >> > > >  #include "mem.h"
> >> > > > +#include "mem_internal.h"
> >> > > >
> >> > > >  #ifdef MALLOC_PREFIX
> >> > > >
> >> > > > @@ -515,3 +516,11 @@ void av_fast_malloc(void *ptr, unsigned int 
> >> > > > *size, size_t min_size)
> >> > > >  ff_fast_malloc(ptr, size, min_size, 0);
> >> > > >  }
> >> > > >
> >> > > > +void av_memcpynt(void *dst, const void *src, size_t size, int 
> >> > > > cpu_flags)
> >> > > > +{
> >> > > > +#if ARCH_X86
> >> > > > +ff_memcpynt_x86(dst, src, size, cpu_flags);
> >> > > > +#else
> >> > > > +memcpy(dst, src, size, cpu_flags);
> >> > > > +#endif
> >> > > > +}
> >> > >
> >> > > Alternatively, what about something like:
> >> > >
> >> > > av_memcpynt_fn av_memcpynt_get_fn(void);
> >> > >
> >> > > modeled after av_pixelutils_get_sad_fn()? This would skip the need for
> >> > > a wrapper calling the right function.
> >> >
> >>
> >> > I don't see much value in this, unless determining the right function
> >> > causes too much overhead.
> >>
> >> I see two advantages, 1. no branch and function call when the function
> >> is called, 2. the cpu_flags must not be passed around, so it's somehow
> >> safer.
> >>
> >> I have no strong preference though, updated (untested patch) in
> >> attachment.
> >> --
> >> FFmpeg = Fierce and Forgiving Merciless Powered Extroverse Gargoyle
> >
> >>  mem.c  |9 +
> >>  mem.h  |   13 +++
> >>  mem_internal.h |   26 +++
> >>  x86/Makefile   |1
> >>  x86/mem.c  |   98 
> >> +
> >>  5 files changed, 147 insertions(+)
> >> f536b25834e0927b8cab5c996042aae697b8d773  
> >> 0003-lavu-mem-add-av_memcpynt_get_fn.patch
> >> From c005ff5405dd48e6b0fed24ed94947f69bfe2783 Mon Sep 17 00:00:00 2001
> >> From: Stefano Sabatini 
> >> Date: Mon, 15 Jun 2015 11:02:50 +0200
> >> Subject: [PATCH] lavu/mem: add av_memcpynt_get_fn()
> >>
> >> Assembly based on code from vlc dxva2.c, commit 62107e56 by Laurent Aimar
> >> .
> >>
> >> TODO: remove use of inline assembly, bump minor, update APIchanges
> >> ---
> >>  libavutil/mem.c  |  9 +
> >>  libavutil/mem.h  | 13 +++
> >>  libavutil/mem_internal.h | 26 +
> >>  libavutil/x86/Makefile   |  1 +
> >>  libavutil/x86/mem.c  | 98 
> >> 
> >>  5 files changed, 147 insertions(+)
> >>  create mode 100644 libavutil/mem_internal.h
> >>  create mode 100644 libavutil/x86/mem.c
> >>
> >> diff --git a/libavutil/mem.c b/libavutil/mem.c
> >> index da291fb..325bfc9 100644
> >> --- a/libavutil/mem.c
> >> +++ b/libavutil/mem.c
> >> @@ -42,6 +42,7 @@
> >>  #include "dynarray.h"
> >>  #include "intreadwrite.h"
> >>  #include "mem.h"
> >> +#include "mem_internal.h"
> >>
> >>  #ifdef MALLOC_PREFIX
> >>
> >> @@ -515,3 +516,11 @@ void av_fast_malloc(void *ptr, unsigned int *size, 
> >> size_t min_size)
> >>  ff_fast_malloc(ptr, size, min_size, 0);
> >>  }
> >>
> >> +av_memcpynt_fn av_memcpynt_get_fn(void)
> >> +{
> >> +#if ARCH_X86
> >> +return ff_memcpynt_get_fn_x86();
> >> +#else
> >> +return memcpy;
> >> +#endif
> >> +}
> >> diff --git a/libavutil/mem.h b/libavutil/mem.h
> >> index 2a1e36d..d9f1b7a 100644
> >> --- a/libavutil/mem.h
> >> +++ b/libavutil/mem.h
> >> @@ -382,6 +382,19 @@ void *av_fast_realloc(void *ptr, unsigned int *size

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-06-16 Thread Stefano Sabatini

On date Tuesday 2015-06-16 14:16:11 +0200, Gwenole Beauchesne encoded:
> Hi,
> 
> 2015-06-16 14:03 GMT+02:00 Michael Niedermayer :
[...]
> >> +#if HAVE_SSE2
> >> +/* Copy 16/64 bytes from srcp to dstp loading data with the SSE>=2 
> >> instruction
> >> + * load and storing data with the SSE>=2 instruction store.
> >> + */
> >> +#define COPY16(dstp, srcp, load, store) \
> >> +__asm__ volatile (  \
> >> +load "  0(%[src]), %%xmm1\n"\
> >> +store " %%xmm1,0(%[dst])\n" \
> >> +: : [dst]"r"(dstp), [src]"r"(srcp) : "memory", "xmm1")
> >> +
> >> +#define COPY64(dstp, srcp, load, store) \
> >> +__asm__ volatile (  \
> >> +load "  0(%[src]), %%xmm1\n"\
> >> +load " 16(%[src]), %%xmm2\n"\
> >> +load " 32(%[src]), %%xmm3\n"\
> >> +load " 48(%[src]), %%xmm4\n"\
> >> +store " %%xmm1,0(%[dst])\n" \
> >> +store " %%xmm2,   16(%[dst])\n" \
> >> +store " %%xmm3,   32(%[dst])\n" \
> >> +store " %%xmm4,   48(%[dst])\n" \
> >> +: : [dst]"r"(dstp), [src]"r"(srcp) : "memory", "xmm1", "xmm2", 
> >> "xmm3", "xmm4")
> >> +#endif
> >> +
> >> +#define COPY_LINE(dstp, srcp, size, load)   \
> >> +const unsigned unaligned = (-(uintptr_t)srcp) & 0x0f;   \
> >> +unsigned x = unaligned; \
> >> +\
> >> +av_assert0(((intptr_t)dstp & 0x0f) == 0);   \
> >> +\
> >> +__asm__ volatile ("mfence");\
> >> +if (!unaligned) {   \
> >> +for (; x+63 < size; x += 64)\
> >> +COPY64(&dstp[x], &srcp[x], load, "movdqa"); \
> >> +} else {\
> >> +COPY16(dst, src, "movdqu", "movdqa");   \
> >> +for (; x+63 < size; x += 64)\
> >> +COPY64(&dstp[x], &srcp[x], load, "movdqu"); \
> >
> > to use SSE registers in inline asm operands or clobber list you need
> > to build with -msse (which probably is default on on x86-64)
> >
> > files build with -msse will result in undefined behavior if anything
> > in them is executed on a pre SSE cpu, as these allow gcc to put
> > SSE instructions directly in the code where it likes
> >
> > The way out of this "design" is not to tell gcc that it passes
> > a string with SSE code to the assembler
> > that is not to use SSE registers in operands and not to put them
> > on the clobber list unless gcc actually is in SSE mode and can use and
> > need them there.
> > see XMM_CLOBBERS*
> 
> Well, from past experience, lying to gcc is generally not a good thing
> either. There are multiple interesting ways it could fail from time to
> time. :)
> 
> Other approaches:
> - With GCC >= 4.4, you can use __attribute__((target(T))) where T =
> "ssse3", "sse4.1", etc. This is the easiest way ;
> - Split into several separate files per target. Though, one would then
> argue that while we are at it why not just start moving to yasm.
> 

> The former approach looks more appealing to me, considering there may
> be an effort to migrate to yasm afterwards.

I plan to port this patch to yasm. I'll ask for help on IRC since
probably it will take too much time otherwise without any guidance.
-- 
FFmpeg = Friendly and Fancy Mind-dumbing Pacific Easy Generator
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-06-16 Thread Gwenole Beauchesne

Hi,

2015-06-16 14:03 GMT+02:00 Michael Niedermayer :
> On Tue, Jun 16, 2015 at 10:35:52AM +0200, Stefano Sabatini wrote:
>> On date Tuesday 2015-06-16 10:20:31 +0200, wm4 encoded:
>> > On Mon, 15 Jun 2015 17:55:35 +0200
>> > Stefano Sabatini  wrote:
>> >
>> > > On date Monday 2015-06-15 11:56:13 +0200, Stefano Sabatini encoded:
>> > > [...]
>> > > > From 3a75ef1e86360cd6f30b8e550307404d0d1c1dba Mon Sep 17 00:00:00 2001
>> > > > From: Stefano Sabatini 
>> > > > Date: Mon, 15 Jun 2015 11:02:50 +0200
>> > > > Subject: [PATCH] lavu/mem: add av_memcpynt() function with x86 
>> > > > optimizations
>> > > >
>> > > > Assembly based on code from vlc dxva2.c, commit 62107e56 by Laurent 
>> > > > Aimar
>> > > > .
>> > > >
>> > > > TODO: bump minor, update APIchanges
>> > > > ---
>> > > >  libavutil/mem.c  |  9 +
>> > > >  libavutil/mem.h  | 14 
>> > > >  libavutil/mem_internal.h | 26 +++
>> > > >  libavutil/x86/Makefile   |  1 +
>> > > >  libavutil/x86/mem.c  | 85 
>> > > > 
>> > > >  5 files changed, 135 insertions(+)
>> > > >  create mode 100644 libavutil/mem_internal.h
>> > > >  create mode 100644 libavutil/x86/mem.c
>> > > >
>> > > > diff --git a/libavutil/mem.c b/libavutil/mem.c
>> > > > index da291fb..0e1eb01 100644
>> > > > --- a/libavutil/mem.c
>> > > > +++ b/libavutil/mem.c
>> > > > @@ -42,6 +42,7 @@
>> > > >  #include "dynarray.h"
>> > > >  #include "intreadwrite.h"
>> > > >  #include "mem.h"
>> > > > +#include "mem_internal.h"
>> > > >
>> > > >  #ifdef MALLOC_PREFIX
>> > > >
>> > > > @@ -515,3 +516,11 @@ void av_fast_malloc(void *ptr, unsigned int 
>> > > > *size, size_t min_size)
>> > > >  ff_fast_malloc(ptr, size, min_size, 0);
>> > > >  }
>> > > >
>> > > > +void av_memcpynt(void *dst, const void *src, size_t size, int 
>> > > > cpu_flags)
>> > > > +{
>> > > > +#if ARCH_X86
>> > > > +ff_memcpynt_x86(dst, src, size, cpu_flags);
>> > > > +#else
>> > > > +memcpy(dst, src, size, cpu_flags);
>> > > > +#endif
>> > > > +}
>> > >
>> > > Alternatively, what about something like:
>> > >
>> > > av_memcpynt_fn av_memcpynt_get_fn(void);
>> > >
>> > > modeled after av_pixelutils_get_sad_fn()? This would skip the need for
>> > > a wrapper calling the right function.
>> >
>>
>> > I don't see much value in this, unless determining the right function
>> > causes too much overhead.
>>
>> I see two advantages, 1. no branch and function call when the function
>> is called, 2. the cpu_flags must not be passed around, so it's somehow
>> safer.
>>
>> I have no strong preference though, updated (untested patch) in
>> attachment.
>> --
>> FFmpeg = Fierce and Forgiving Merciless Powered Extroverse Gargoyle
>
>>  mem.c  |9 +
>>  mem.h  |   13 +++
>>  mem_internal.h |   26 +++
>>  x86/Makefile   |1
>>  x86/mem.c  |   98 
>> +
>>  5 files changed, 147 insertions(+)
>> f536b25834e0927b8cab5c996042aae697b8d773  
>> 0003-lavu-mem-add-av_memcpynt_get_fn.patch
>> From c005ff5405dd48e6b0fed24ed94947f69bfe2783 Mon Sep 17 00:00:00 2001
>> From: Stefano Sabatini 
>> Date: Mon, 15 Jun 2015 11:02:50 +0200
>> Subject: [PATCH] lavu/mem: add av_memcpynt_get_fn()
>>
>> Assembly based on code from vlc dxva2.c, commit 62107e56 by Laurent Aimar
>> .
>>
>> TODO: remove use of inline assembly, bump minor, update APIchanges
>> ---
>>  libavutil/mem.c  |  9 +
>>  libavutil/mem.h  | 13 +++
>>  libavutil/mem_internal.h | 26 +
>>  libavutil/x86/Makefile   |  1 +
>>  libavutil/x86/mem.c  | 98 
>> 
>>  5 files changed, 147 insertions(+)
>>  create mode 100644 libavutil/mem_internal.h
>>  create mode 100644 libavutil/x86/mem.c
>>
>> diff --git a/libavutil/mem.c b/libavutil/mem.c
>> index da291fb..325bfc9 100644
>> --- a/libavutil/mem.c
>> +++ b/libavutil/mem.c
>> @@ -42,6 +42,7 @@
>>  #include "dynarray.h"
>>  #include "intreadwrite.h"
>>  #include "mem.h"
>> +#include "mem_internal.h"
>>
>>  #ifdef MALLOC_PREFIX
>>
>> @@ -515,3 +516,11 @@ void av_fast_malloc(void *ptr, unsigned int *size, 
>> size_t min_size)
>>  ff_fast_malloc(ptr, size, min_size, 0);
>>  }
>>
>> +av_memcpynt_fn av_memcpynt_get_fn(void)
>> +{
>> +#if ARCH_X86
>> +return ff_memcpynt_get_fn_x86();
>> +#else
>> +return memcpy;
>> +#endif
>> +}
>> diff --git a/libavutil/mem.h b/libavutil/mem.h
>> index 2a1e36d..d9f1b7a 100644
>> --- a/libavutil/mem.h
>> +++ b/libavutil/mem.h
>> @@ -382,6 +382,19 @@ void *av_fast_realloc(void *ptr, unsigned int *size, 
>> size_t min_size);
>>   */
>>  void av_fast_malloc(void *ptr, unsigned int *size, size_t min_size);
>>
>> +typedef void* (*av_memcpynt_fn)(void *dst, const void *src, size_t size);
>> +
>> +/**
>> + * Return possibly optimized function to copy size bytes from from src
>> + * to dst, using non-temporal copy.
>> + *
>> + * The returned function w

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-06-16 Thread Gwenole Beauchesne

Hi,

2015-06-16 10:35 GMT+02:00 Stefano Sabatini :
> On date Tuesday 2015-06-16 10:20:31 +0200, wm4 encoded:
>> On Mon, 15 Jun 2015 17:55:35 +0200
>> Stefano Sabatini  wrote:
>>
>> > On date Monday 2015-06-15 11:56:13 +0200, Stefano Sabatini encoded:
>> > [...]
>> > > From 3a75ef1e86360cd6f30b8e550307404d0d1c1dba Mon Sep 17 00:00:00 2001
>> > > From: Stefano Sabatini 
>> > > Date: Mon, 15 Jun 2015 11:02:50 +0200
>> > > Subject: [PATCH] lavu/mem: add av_memcpynt() function with x86 
>> > > optimizations
>> > >
>> > > Assembly based on code from vlc dxva2.c, commit 62107e56 by Laurent Aimar
>> > > .
>> > >
>> > > TODO: bump minor, update APIchanges
>> > > ---
>> > >  libavutil/mem.c  |  9 +
>> > >  libavutil/mem.h  | 14 
>> > >  libavutil/mem_internal.h | 26 +++
>> > >  libavutil/x86/Makefile   |  1 +
>> > >  libavutil/x86/mem.c  | 85 
>> > > 
>> > >  5 files changed, 135 insertions(+)
>> > >  create mode 100644 libavutil/mem_internal.h
>> > >  create mode 100644 libavutil/x86/mem.c
>> > >
>> > > diff --git a/libavutil/mem.c b/libavutil/mem.c
>> > > index da291fb..0e1eb01 100644
>> > > --- a/libavutil/mem.c
>> > > +++ b/libavutil/mem.c
>> > > @@ -42,6 +42,7 @@
>> > >  #include "dynarray.h"
>> > >  #include "intreadwrite.h"
>> > >  #include "mem.h"
>> > > +#include "mem_internal.h"
>> > >
>> > >  #ifdef MALLOC_PREFIX
>> > >
>> > > @@ -515,3 +516,11 @@ void av_fast_malloc(void *ptr, unsigned int *size, 
>> > > size_t min_size)
>> > >  ff_fast_malloc(ptr, size, min_size, 0);
>> > >  }
>> > >
>> > > +void av_memcpynt(void *dst, const void *src, size_t size, int cpu_flags)
>> > > +{
>> > > +#if ARCH_X86
>> > > +ff_memcpynt_x86(dst, src, size, cpu_flags);
>> > > +#else
>> > > +memcpy(dst, src, size, cpu_flags);
>> > > +#endif
>> > > +}
>> >
>> > Alternatively, what about something like:
>> >
>> > av_memcpynt_fn av_memcpynt_get_fn(void);
>> >
>> > modeled after av_pixelutils_get_sad_fn()? This would skip the need for
>> > a wrapper calling the right function.
>>
>
>> I don't see much value in this, unless determining the right function
>> causes too much overhead.
>
> I see two advantages, 1. no branch and function call when the function
> is called, 2. the cpu_flags must not be passed around, so it's somehow
> safer.

Interesting approach. You probably could also use something similar to
sws context you build up based on surface size, and other
characteristics (flags)?

Regards,
-- 
Gwenole Beauchesne
Intel Corporation SAS / 2 rue de Paris, 92196 Meudon Cedex, France
Registration Number (RCS): Nanterre B 302 456 199
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-06-16 Thread Michael Niedermayer

On Tue, Jun 16, 2015 at 10:35:52AM +0200, Stefano Sabatini wrote:
> On date Tuesday 2015-06-16 10:20:31 +0200, wm4 encoded:
> > On Mon, 15 Jun 2015 17:55:35 +0200
> > Stefano Sabatini  wrote:
> > 
> > > On date Monday 2015-06-15 11:56:13 +0200, Stefano Sabatini encoded:
> > > [...]
> > > > From 3a75ef1e86360cd6f30b8e550307404d0d1c1dba Mon Sep 17 00:00:00 2001
> > > > From: Stefano Sabatini 
> > > > Date: Mon, 15 Jun 2015 11:02:50 +0200
> > > > Subject: [PATCH] lavu/mem: add av_memcpynt() function with x86 
> > > > optimizations
> > > > 
> > > > Assembly based on code from vlc dxva2.c, commit 62107e56 by Laurent 
> > > > Aimar
> > > > .
> > > > 
> > > > TODO: bump minor, update APIchanges
> > > > ---
> > > >  libavutil/mem.c  |  9 +
> > > >  libavutil/mem.h  | 14 
> > > >  libavutil/mem_internal.h | 26 +++
> > > >  libavutil/x86/Makefile   |  1 +
> > > >  libavutil/x86/mem.c  | 85 
> > > > 
> > > >  5 files changed, 135 insertions(+)
> > > >  create mode 100644 libavutil/mem_internal.h
> > > >  create mode 100644 libavutil/x86/mem.c
> > > > 
> > > > diff --git a/libavutil/mem.c b/libavutil/mem.c
> > > > index da291fb..0e1eb01 100644
> > > > --- a/libavutil/mem.c
> > > > +++ b/libavutil/mem.c
> > > > @@ -42,6 +42,7 @@
> > > >  #include "dynarray.h"
> > > >  #include "intreadwrite.h"
> > > >  #include "mem.h"
> > > > +#include "mem_internal.h"
> > > >  
> > > >  #ifdef MALLOC_PREFIX
> > > >  
> > > > @@ -515,3 +516,11 @@ void av_fast_malloc(void *ptr, unsigned int *size, 
> > > > size_t min_size)
> > > >  ff_fast_malloc(ptr, size, min_size, 0);
> > > >  }
> > > >  
> > > > +void av_memcpynt(void *dst, const void *src, size_t size, int 
> > > > cpu_flags)
> > > > +{
> > > > +#if ARCH_X86
> > > > +ff_memcpynt_x86(dst, src, size, cpu_flags);
> > > > +#else
> > > > +memcpy(dst, src, size, cpu_flags);
> > > > +#endif
> > > > +}
> > > 
> > > Alternatively, what about something like:
> > > 
> > > av_memcpynt_fn av_memcpynt_get_fn(void);
> > > 
> > > modeled after av_pixelutils_get_sad_fn()? This would skip the need for
> > > a wrapper calling the right function.
> > 
> 
> > I don't see much value in this, unless determining the right function
> > causes too much overhead.
> 
> I see two advantages, 1. no branch and function call when the function
> is called, 2. the cpu_flags must not be passed around, so it's somehow
> safer.
> 
> I have no strong preference though, updated (untested patch) in
> attachment.
> -- 
> FFmpeg = Fierce and Forgiving Merciless Powered Extroverse Gargoyle

>  mem.c  |9 +
>  mem.h  |   13 +++
>  mem_internal.h |   26 +++
>  x86/Makefile   |1 
>  x86/mem.c  |   98 
> +
>  5 files changed, 147 insertions(+)
> f536b25834e0927b8cab5c996042aae697b8d773  
> 0003-lavu-mem-add-av_memcpynt_get_fn.patch
> From c005ff5405dd48e6b0fed24ed94947f69bfe2783 Mon Sep 17 00:00:00 2001
> From: Stefano Sabatini 
> Date: Mon, 15 Jun 2015 11:02:50 +0200
> Subject: [PATCH] lavu/mem: add av_memcpynt_get_fn()
> 
> Assembly based on code from vlc dxva2.c, commit 62107e56 by Laurent Aimar
> .
> 
> TODO: remove use of inline assembly, bump minor, update APIchanges
> ---
>  libavutil/mem.c  |  9 +
>  libavutil/mem.h  | 13 +++
>  libavutil/mem_internal.h | 26 +
>  libavutil/x86/Makefile   |  1 +
>  libavutil/x86/mem.c  | 98 
> 
>  5 files changed, 147 insertions(+)
>  create mode 100644 libavutil/mem_internal.h
>  create mode 100644 libavutil/x86/mem.c
> 
> diff --git a/libavutil/mem.c b/libavutil/mem.c
> index da291fb..325bfc9 100644
> --- a/libavutil/mem.c
> +++ b/libavutil/mem.c
> @@ -42,6 +42,7 @@
>  #include "dynarray.h"
>  #include "intreadwrite.h"
>  #include "mem.h"
> +#include "mem_internal.h"
>  
>  #ifdef MALLOC_PREFIX
>  
> @@ -515,3 +516,11 @@ void av_fast_malloc(void *ptr, unsigned int *size, 
> size_t min_size)
>  ff_fast_malloc(ptr, size, min_size, 0);
>  }
>  
> +av_memcpynt_fn av_memcpynt_get_fn(void)
> +{
> +#if ARCH_X86
> +return ff_memcpynt_get_fn_x86();
> +#else
> +return memcpy;
> +#endif
> +}
> diff --git a/libavutil/mem.h b/libavutil/mem.h
> index 2a1e36d..d9f1b7a 100644
> --- a/libavutil/mem.h
> +++ b/libavutil/mem.h
> @@ -382,6 +382,19 @@ void *av_fast_realloc(void *ptr, unsigned int *size, 
> size_t min_size);
>   */
>  void av_fast_malloc(void *ptr, unsigned int *size, size_t min_size);
>  
> +typedef void* (*av_memcpynt_fn)(void *dst, const void *src, size_t size);
> +
> +/**
> + * Return possibly optimized function to copy size bytes from from src
> + * to dst, using non-temporal copy.
> + *
> + * The returned function works as memcpy, but adopts non-temporal
> + * instructios when available. This can lead to better performances
> + * when transferring data from source to destination is e

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-06-16 Thread Stefano Sabatini

On date Tuesday 2015-06-16 10:20:31 +0200, wm4 encoded:
> On Mon, 15 Jun 2015 17:55:35 +0200
> Stefano Sabatini  wrote:
> 
> > On date Monday 2015-06-15 11:56:13 +0200, Stefano Sabatini encoded:
> > [...]
> > > From 3a75ef1e86360cd6f30b8e550307404d0d1c1dba Mon Sep 17 00:00:00 2001
> > > From: Stefano Sabatini 
> > > Date: Mon, 15 Jun 2015 11:02:50 +0200
> > > Subject: [PATCH] lavu/mem: add av_memcpynt() function with x86 
> > > optimizations
> > > 
> > > Assembly based on code from vlc dxva2.c, commit 62107e56 by Laurent Aimar
> > > .
> > > 
> > > TODO: bump minor, update APIchanges
> > > ---
> > >  libavutil/mem.c  |  9 +
> > >  libavutil/mem.h  | 14 
> > >  libavutil/mem_internal.h | 26 +++
> > >  libavutil/x86/Makefile   |  1 +
> > >  libavutil/x86/mem.c  | 85 
> > > 
> > >  5 files changed, 135 insertions(+)
> > >  create mode 100644 libavutil/mem_internal.h
> > >  create mode 100644 libavutil/x86/mem.c
> > > 
> > > diff --git a/libavutil/mem.c b/libavutil/mem.c
> > > index da291fb..0e1eb01 100644
> > > --- a/libavutil/mem.c
> > > +++ b/libavutil/mem.c
> > > @@ -42,6 +42,7 @@
> > >  #include "dynarray.h"
> > >  #include "intreadwrite.h"
> > >  #include "mem.h"
> > > +#include "mem_internal.h"
> > >  
> > >  #ifdef MALLOC_PREFIX
> > >  
> > > @@ -515,3 +516,11 @@ void av_fast_malloc(void *ptr, unsigned int *size, 
> > > size_t min_size)
> > >  ff_fast_malloc(ptr, size, min_size, 0);
> > >  }
> > >  
> > > +void av_memcpynt(void *dst, const void *src, size_t size, int cpu_flags)
> > > +{
> > > +#if ARCH_X86
> > > +ff_memcpynt_x86(dst, src, size, cpu_flags);
> > > +#else
> > > +memcpy(dst, src, size, cpu_flags);
> > > +#endif
> > > +}
> > 
> > Alternatively, what about something like:
> > 
> > av_memcpynt_fn av_memcpynt_get_fn(void);
> > 
> > modeled after av_pixelutils_get_sad_fn()? This would skip the need for
> > a wrapper calling the right function.
> 

> I don't see much value in this, unless determining the right function
> causes too much overhead.

I see two advantages, 1. no branch and function call when the function
is called, 2. the cpu_flags must not be passed around, so it's somehow
safer.

I have no strong preference though, updated (untested patch) in
attachment.
-- 
FFmpeg = Fierce and Forgiving Merciless Powered Extroverse Gargoyle
>From c005ff5405dd48e6b0fed24ed94947f69bfe2783 Mon Sep 17 00:00:00 2001
From: Stefano Sabatini 
Date: Mon, 15 Jun 2015 11:02:50 +0200
Subject: [PATCH] lavu/mem: add av_memcpynt_get_fn()

Assembly based on code from vlc dxva2.c, commit 62107e56 by Laurent Aimar
.

TODO: remove use of inline assembly, bump minor, update APIchanges
---
 libavutil/mem.c  |  9 +
 libavutil/mem.h  | 13 +++
 libavutil/mem_internal.h | 26 +
 libavutil/x86/Makefile   |  1 +
 libavutil/x86/mem.c  | 98 
 5 files changed, 147 insertions(+)
 create mode 100644 libavutil/mem_internal.h
 create mode 100644 libavutil/x86/mem.c

diff --git a/libavutil/mem.c b/libavutil/mem.c
index da291fb..325bfc9 100644
--- a/libavutil/mem.c
+++ b/libavutil/mem.c
@@ -42,6 +42,7 @@
 #include "dynarray.h"
 #include "intreadwrite.h"
 #include "mem.h"
+#include "mem_internal.h"
 
 #ifdef MALLOC_PREFIX
 
@@ -515,3 +516,11 @@ void av_fast_malloc(void *ptr, unsigned int *size, size_t min_size)
 ff_fast_malloc(ptr, size, min_size, 0);
 }
 
+av_memcpynt_fn av_memcpynt_get_fn(void)
+{
+#if ARCH_X86
+return ff_memcpynt_get_fn_x86();
+#else
+return memcpy;
+#endif
+}
diff --git a/libavutil/mem.h b/libavutil/mem.h
index 2a1e36d..d9f1b7a 100644
--- a/libavutil/mem.h
+++ b/libavutil/mem.h
@@ -382,6 +382,19 @@ void *av_fast_realloc(void *ptr, unsigned int *size, size_t min_size);
  */
 void av_fast_malloc(void *ptr, unsigned int *size, size_t min_size);
 
+typedef void* (*av_memcpynt_fn)(void *dst, const void *src, size_t size);
+
+/**
+ * Return possibly optimized function to copy size bytes from from src
+ * to dst, using non-temporal copy.
+ *
+ * The returned function works as memcpy, but adopts non-temporal
+ * instructios when available. This can lead to better performances
+ * when transferring data from source to destination is expensive, for
+ * example when reading from GPU memory.
+ */
+av_memcpynt_fn av_memcpynt_get_fn(void);
+
 /**
  * @}
  */
diff --git a/libavutil/mem_internal.h b/libavutil/mem_internal.h
new file mode 100644
index 000..de61cba
--- /dev/null
+++ b/libavutil/mem_internal.h
@@ -0,0 +1,26 @@
+/*
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-06-16 Thread wm4

On Mon, 15 Jun 2015 17:55:35 +0200
Stefano Sabatini  wrote:

> On date Monday 2015-06-15 11:56:13 +0200, Stefano Sabatini encoded:
> [...]
> > From 3a75ef1e86360cd6f30b8e550307404d0d1c1dba Mon Sep 17 00:00:00 2001
> > From: Stefano Sabatini 
> > Date: Mon, 15 Jun 2015 11:02:50 +0200
> > Subject: [PATCH] lavu/mem: add av_memcpynt() function with x86 optimizations
> > 
> > Assembly based on code from vlc dxva2.c, commit 62107e56 by Laurent Aimar
> > .
> > 
> > TODO: bump minor, update APIchanges
> > ---
> >  libavutil/mem.c  |  9 +
> >  libavutil/mem.h  | 14 
> >  libavutil/mem_internal.h | 26 +++
> >  libavutil/x86/Makefile   |  1 +
> >  libavutil/x86/mem.c  | 85 
> > 
> >  5 files changed, 135 insertions(+)
> >  create mode 100644 libavutil/mem_internal.h
> >  create mode 100644 libavutil/x86/mem.c
> > 
> > diff --git a/libavutil/mem.c b/libavutil/mem.c
> > index da291fb..0e1eb01 100644
> > --- a/libavutil/mem.c
> > +++ b/libavutil/mem.c
> > @@ -42,6 +42,7 @@
> >  #include "dynarray.h"
> >  #include "intreadwrite.h"
> >  #include "mem.h"
> > +#include "mem_internal.h"
> >  
> >  #ifdef MALLOC_PREFIX
> >  
> > @@ -515,3 +516,11 @@ void av_fast_malloc(void *ptr, unsigned int *size, 
> > size_t min_size)
> >  ff_fast_malloc(ptr, size, min_size, 0);
> >  }
> >  
> > +void av_memcpynt(void *dst, const void *src, size_t size, int cpu_flags)
> > +{
> > +#if ARCH_X86
> > +ff_memcpynt_x86(dst, src, size, cpu_flags);
> > +#else
> > +memcpy(dst, src, size, cpu_flags);
> > +#endif
> > +}
> 
> Alternatively, what about something like:
> 
> av_memcpynt_fn av_memcpynt_get_fn(void);
> 
> modeled after av_pixelutils_get_sad_fn()? This would skip the need for
> a wrapper calling the right function.

I don't see much value in this, unless determining the right function
causes too much overhead.
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-06-15 Thread Stefano Sabatini

On date Monday 2015-06-15 11:56:13 +0200, Stefano Sabatini encoded:
[...]
> From 3a75ef1e86360cd6f30b8e550307404d0d1c1dba Mon Sep 17 00:00:00 2001
> From: Stefano Sabatini 
> Date: Mon, 15 Jun 2015 11:02:50 +0200
> Subject: [PATCH] lavu/mem: add av_memcpynt() function with x86 optimizations
> 
> Assembly based on code from vlc dxva2.c, commit 62107e56 by Laurent Aimar
> .
> 
> TODO: bump minor, update APIchanges
> ---
>  libavutil/mem.c  |  9 +
>  libavutil/mem.h  | 14 
>  libavutil/mem_internal.h | 26 +++
>  libavutil/x86/Makefile   |  1 +
>  libavutil/x86/mem.c  | 85 
> 
>  5 files changed, 135 insertions(+)
>  create mode 100644 libavutil/mem_internal.h
>  create mode 100644 libavutil/x86/mem.c
> 
> diff --git a/libavutil/mem.c b/libavutil/mem.c
> index da291fb..0e1eb01 100644
> --- a/libavutil/mem.c
> +++ b/libavutil/mem.c
> @@ -42,6 +42,7 @@
>  #include "dynarray.h"
>  #include "intreadwrite.h"
>  #include "mem.h"
> +#include "mem_internal.h"
>  
>  #ifdef MALLOC_PREFIX
>  
> @@ -515,3 +516,11 @@ void av_fast_malloc(void *ptr, unsigned int *size, 
> size_t min_size)
>  ff_fast_malloc(ptr, size, min_size, 0);
>  }
>  
> +void av_memcpynt(void *dst, const void *src, size_t size, int cpu_flags)
> +{
> +#if ARCH_X86
> +ff_memcpynt_x86(dst, src, size, cpu_flags);
> +#else
> +memcpy(dst, src, size, cpu_flags);
> +#endif
> +}

Alternatively, what about something like:

av_memcpynt_fn av_memcpynt_get_fn(void);

modeled after av_pixelutils_get_sad_fn()? This would skip the need for
a wrapper calling the right function.
-- 
FFmpeg = Frightening and Fantastic Murdering Portentous Erratic Guru
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-06-15 Thread Stefano Sabatini

On date Saturday 2015-06-13 14:20:07 +0200, Hendrik Leppkes encoded:
> On Thu, Jun 11, 2015 at 8:54 PM, wm4  wrote:
> > On Thu, 11 Jun 2015 17:24:45 +0200
> > Stefano Sabatini  wrote:
> >
> >> Next step would be the use of YASM, but I only want to test if the
> >> general approach is fine (and if the API is not too specific). Also if
> >> someone wants to step up and port it to YASM I'm all for it, since
> >> ASM/YASM is far from being my area of expertise.
> >
> > Personally, I'd probably just
> > 1. export the GPU memcpy function, and
> > 2. export a function to copy AVFrames using this function
> 
> I concur. A basic optimized memcpy with specific constraints (ie.
> requires aligned input/output, always copies in 16-byte chunks, so
> in/out buffers need to be padded appropriately), to keep the required
> ASM code simple.
> These constraints are generally always fulfilled if you have a GPU
> frame on the input, since they will have appropriate strides (and if
> in question, we control allocation of the GPU surfaces as well), and
> we control the output memory buffer anyway.
> 
> On top of that a convenience function that deals with pixel formats,
> strides, planes, and whatnot, and then uses this function.
> A generic C version of the basic copy function shouldn't be needed, we
> could just use memcpy for that.. or a tiny wrapper that calls memcpy,
> anyway.

This is my first attempt, the added function is named av_memcpynt(),
it is using inline assembly which should be replaced by yasm once me
or someone else figures out how to do it.

An av_image_copynt_plane() function can be built on top of that (but
in this case it would be better to inline the av_memcpynt() function).

BTW I dropped the requirement of 16-bits alignment on the size
variable which is required by the VLC code but which looks unnecessary
to me.
-- 
FFmpeg = Furious and Foolish Marvellous Pacific Egregious Ghost
>From 3a75ef1e86360cd6f30b8e550307404d0d1c1dba Mon Sep 17 00:00:00 2001
From: Stefano Sabatini 
Date: Mon, 15 Jun 2015 11:02:50 +0200
Subject: [PATCH] lavu/mem: add av_memcpynt() function with x86 optimizations

Assembly based on code from vlc dxva2.c, commit 62107e56 by Laurent Aimar
.

TODO: bump minor, update APIchanges
---
 libavutil/mem.c  |  9 +
 libavutil/mem.h  | 14 
 libavutil/mem_internal.h | 26 +++
 libavutil/x86/Makefile   |  1 +
 libavutil/x86/mem.c  | 85 
 5 files changed, 135 insertions(+)
 create mode 100644 libavutil/mem_internal.h
 create mode 100644 libavutil/x86/mem.c

diff --git a/libavutil/mem.c b/libavutil/mem.c
index da291fb..0e1eb01 100644
--- a/libavutil/mem.c
+++ b/libavutil/mem.c
@@ -42,6 +42,7 @@
 #include "dynarray.h"
 #include "intreadwrite.h"
 #include "mem.h"
+#include "mem_internal.h"
 
 #ifdef MALLOC_PREFIX
 
@@ -515,3 +516,11 @@ void av_fast_malloc(void *ptr, unsigned int *size, size_t min_size)
 ff_fast_malloc(ptr, size, min_size, 0);
 }
 
+void av_memcpynt(void *dst, const void *src, size_t size, int cpu_flags)
+{
+#if ARCH_X86
+ff_memcpynt_x86(dst, src, size, cpu_flags);
+#else
+memcpy(dst, src, size, cpu_flags);
+#endif
+}
diff --git a/libavutil/mem.h b/libavutil/mem.h
index 2a1e36d..bbad313 100644
--- a/libavutil/mem.h
+++ b/libavutil/mem.h
@@ -383,6 +383,20 @@ void *av_fast_realloc(void *ptr, unsigned int *size, size_t min_size);
 void av_fast_malloc(void *ptr, unsigned int *size, size_t min_size);
 
 /**
+ * Copy size bytes from from src to dst, using non-temporal copy
+ * functions when available.
+ *
+ * This function works as memcpy, but adopts non-temporal instructios
+ * when available. This can lead to better performances when
+ * transferring data from source to destination is expensive, for
+ * example when reading from GPU memory.
+ *
+ * @param dst destination memory pointer, must be aligned to 16 bits
+ * @param cpu_flags as returned by av_get_cpu_flags()
+ */
+void av_memcpynt(void *dst, const void *src, size_t size, int cpu_flags);
+
+/**
  * @}
  */
 
diff --git a/libavutil/mem_internal.h b/libavutil/mem_internal.h
new file mode 100644
index 000..371be31
--- /dev/null
+++ b/libavutil/mem_internal.h
@@ -0,0 +1,26 @@
+/*
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-06-13 Thread Hendrik Leppkes

On Thu, Jun 11, 2015 at 8:54 PM, wm4  wrote:
> On Thu, 11 Jun 2015 17:24:45 +0200
> Stefano Sabatini  wrote:
>
>> Next step would be the use of YASM, but I only want to test if the
>> general approach is fine (and if the API is not too specific). Also if
>> someone wants to step up and port it to YASM I'm all for it, since
>> ASM/YASM is far from being my area of expertise.
>
> Personally, I'd probably just
> 1. export the GPU memcpy function, and
> 2. export a function to copy AVFrames using this function

I concur. A basic optimized memcpy with specific constraints (ie.
requires aligned input/output, always copies in 16-byte chunks, so
in/out buffers need to be padded appropriately), to keep the required
ASM code simple.
These constraints are generally always fulfilled if you have a GPU
frame on the input, since they will have appropriate strides (and if
in question, we control allocation of the GPU surfaces as well), and
we control the output memory buffer anyway.

On top of that a convenience function that deals with pixel formats,
strides, planes, and whatnot, and then uses this function.
A generic C version of the basic copy function shouldn't be needed, we
could just use memcpy for that.. or a tiny wrapper that calls memcpy,
anyway.

- Hendrik
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-06-11 Thread wm4

On Thu, 11 Jun 2015 17:24:45 +0200
Stefano Sabatini  wrote:

> Next step would be the use of YASM, but I only want to test if the
> general approach is fine (and if the API is not too specific). Also if
> someone wants to step up and port it to YASM I'm all for it, since
> ASM/YASM is far from being my area of expertise.

Personally, I'd probably just
1. export the GPU memcpy function, and
2. export a function to copy AVFrames using this function
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-06-11 Thread Stefano Sabatini

On date Friday 2015-05-29 09:47:58 -0700, Timothy Gu encoded:
> On Fri, May 29, 2015 at 03:49:22PM +0200, Stefano Sabatini wrote:
[...]
> >  OBJS-$(CONFIG_PIXELUTILS) += x86/pixelutils_init.o  \
> > diff --git a/libavutil/x86/imgutils.c b/libavutil/x86/imgutils.c
> > new file mode 100644
> > index 000..8b3ed0f
> > --- /dev/null
> > +++ b/libavutil/x86/imgutils.c
> > @@ -0,0 +1,95 @@
> > +/*
> > + * This file is part of FFmpeg.
> > + *
> > + * FFmpeg is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU Lesser General Public
> > + * License as published by the Free Software Foundation; either
> > + * version 2.1 of the License, or (at your option) any later version.
> > + *
> > + * FFmpeg is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > + * Lesser General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU Lesser General Public
> > + * License along with FFmpeg; if not, write to the Free Software
> > + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 
> > 02110-1301 USA
> > + */
> > +
> > +#include 
> > +#include "config.h"
> > +#include "libavutil/avassert.h"
> > +#include "libavutil/imgutils.h"
> > +#include "libavutil/imgutils_internal.h"
> > +
> > +#if HAVE_SSE2
> > +/* Copy 16/64 bytes from srcp to dstp loading data with the SSE>=2 
> > instruction
> > + * load and storing data with the SSE>=2 instruction store.
> > + */
> > +#define COPY16(dstp, srcp, load, store) \
> > +__asm__ volatile (  \
> > +load "  0(%[src]), %%xmm1\n"\
> > +store " %%xmm1,0(%[dst])\n" \
> > +: : [dst]"r"(dstp), [src]"r"(srcp) : "memory", "xmm1")
> > +
> > +#define COPY64(dstp, srcp, load, store) \
> > +__asm__ volatile (  \
> > +load "  0(%[src]), %%xmm1\n"\
> > +load " 16(%[src]), %%xmm2\n"\
> > +load " 32(%[src]), %%xmm3\n"\
> > +load " 48(%[src]), %%xmm4\n"\
> > +store " %%xmm1,0(%[dst])\n" \
> > +store " %%xmm2,   16(%[dst])\n" \
> > +store " %%xmm3,   32(%[dst])\n" \
> > +store " %%xmm4,   48(%[dst])\n" \
> > +: : [dst]"r"(dstp), [src]"r"(srcp) : "memory", "xmm1", "xmm2", 
> > "xmm3", "xmm4")
> > +#endif
> > +
> > +void ff_image_copy_plane_from_uswc_x86(uint8_t *dst, size_t dst_linesize,
> > +  const uint8_t *src, size_t src_linesize,
> > +  unsigned bytewidth, unsigned height,
> > +  int cpu_flags)
> > +{
> > +#if !HAVE_SSSE3
> 

> Are any SSSE3 instructions used?

No. I re-checked, MOVDQA/MOVDQU were introduced in SSE2, MOVNTDQA in SSE4. 

> > +return av_image_copy_plane(dst, dst_linesize, src, src_linesize, 
> > bytewidth, height);
> > +#endif
> > +
> > +av_assert0(((intptr_t)dst & 0x0f) == 0 && (dst_linesize & 0x0f) == 0);
> > +
> > +__asm__ volatile ("mfence");
> > +
> > +for (unsigned y = 0; y < height; y++) {
> > +const unsigned unaligned = (-(uintptr_t)src) & 0x0f;
> > +unsigned x = unaligned;
> > +
> 
> > +#if HAVE_SSE42
> > +if (cpu_flags & AV_CPU_FLAG_SSE4) {
> 
> movntdqa is an SSE4.1 instruction, so this should work better:
> 
> if (INLINE_SSE4(cpu_flags))
> 
> That checks both HAVE_SSE4_INLINE and cpu_flags for AV_CPU_FLAG_SSE4.
> 
> (But then like others have said new inline asm code shouldn't be added in the
> first place)

Next step would be the use of YASM, but I only want to test if the
general approach is fine (and if the API is not too specific). Also if
someone wants to step up and port it to YASM I'm all for it, since
ASM/YASM is far from being my area of expertise.
-- 
FFmpeg = Fiendish Fabulous Most Pure Evangelical God
>From ec96aee1930247248a5e438171c120ea3f5dbbea Mon Sep 17 00:00:00 2001
From: Stefano Sabatini 
Date: Fri, 15 May 2015 18:58:17 +0200
Subject: [PATCH] lavu/imgutils: add av_image_copy_plane_from_uswc() function.

This function allows support to optimized GPU to CPU.

Based on code from vlc dxva2.c, commit 62107e56 by Laurent Aimar
.

TODO: fix integration with the build system, update APIchanges and bump
minor once ready
---
 libavutil/imgutils.c  |  13 +
 libavutil/imgutils.h  |  18 ++
 libavutil/imgutils_internal.h |  29 ++
 libavutil/x86/Makefile|   1 +
 libavutil/x86/imgutils.c  | 126 ++
 5 files changed, 187 insertions(+)
 create mode 100644 libavutil/imgutils_internal.h
 create mode 100644 libavutil/x86/imgutils.c

diff --git a/libavutil/imgutils.c b/libavutil/imgutils.c
index ef0e671..59a0054 100644
--- a/libavutil/imgutils.c
+++ b/libavutil/imgutils.c
@@ -30,6 +30,7 @@
 #include "mathematics.h"
 #include "pixdesc.h"
 #include "r

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-05-29 Thread Timothy Gu

On Fri, May 29, 2015 at 03:49:22PM +0200, Stefano Sabatini wrote:
> @@ -405,3 +406,16 @@ int av_image_copy_to_buffer(uint8_t *dst, int dst_size,
>  
>  return size;
>  }
> +
> +void av_image_copy_plane_from_uswc(uint8_t *dst, size_t dst_linesize,
> +const uint8_t *src, size_t src_linesize,
> +unsigned bytewidth, unsigned height,
> +int cpu_flags)
> +{
> +#if !HAVE_SSSE3

> +av_unused(cpu_flags);

av_used has a different definition than VLC_UNUSED. Just use a (void) cast.

> +av_image_copy_plane(dst, dst_linesize, src, src_linesize, bytewidth, 
> height);
> +#else
> +ff_image_copy_plane_from_uswc_x86(dst, dst_linesize, src, src_linesize, 
> bytewidth, height, cpu_flags);
> +#endif
> +}
> diff --git a/libavutil/imgutils.h b/libavutil/imgutils.h
> index 23282a3..184e1e7 100644
> --- a/libavutil/imgutils.h
> +++ b/libavutil/imgutils.h
> @@ -111,6 +111,24 @@ void av_image_copy_plane(uint8_t   *dst, int 
> dst_linesize,
>   int bytewidth, int height);
>  
>  /**
> + * Copy image plane from src to dst, similar to av_image_copy_plane().
> + * src must be an USWC buffer.
> + * It performs optimized copy from "Uncacheable Speculative Write
> + * Combining" memory as used by some video surface.
> + * It is really efficient only when SSE4.1 is available.
> + *
> + * In case the target CPU does not support USWC caching this function
> + * will be equivalent to av_image_copy_plane().
> + *
> + * @param cpu_flags as returned by av_get_cpu_flags()
> + * @see av_image_copy_plane()
> + */
> +void av_image_copy_plane_from_uswc(uint8_t *dst, size_t dst_linesize,
> +   const uint8_t *src, size_t src_linesize,
> +   unsigned bytewidth, unsigned height,
> +   int cpu_flags);
> +
> +/**
>   * Copy image in src_data to dst_data.
>   *
>   * @param dst_linesizes linesizes for the image in dst_data
> diff --git a/libavutil/imgutils_internal.h b/libavutil/imgutils_internal.h
> new file mode 100644
> index 000..9576afe
> --- /dev/null
> +++ b/libavutil/imgutils_internal.h
> @@ -0,0 +1,29 @@
> +/*
> + * This file is part of FFmpeg.
> + *
> + * FFmpeg is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2.1 of the License, or (at your option) any later version.
> + *
> + * FFmpeg is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with FFmpeg; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 
> USA
> + */
> +
> +#ifndef AVUTIL_IMGUTILS_INTERNAL_H
> +#define AVUTIL_IMGUTILS_INTERNAL_H
> +
> +#include "imgutils.h"
> +
> +void ff_image_copy_plane_from_uswc_x86(uint8_t *dst, size_t dst_linesize,
> +const uint8_t *src, size_t src_linesize,
> +unsigned bytewidth, unsigned height,
> +int cpu_flags);
> +
> +#endif /* AVUTIL_IMGUTILS_INTERNAL_H */
> diff --git a/libavutil/x86/Makefile b/libavutil/x86/Makefile
> index eb70a62..a719c00 100644
> --- a/libavutil/x86/Makefile
> +++ b/libavutil/x86/Makefile
> @@ -1,5 +1,6 @@
>  OBJS += x86/cpu.o   \
>  x86/float_dsp_init.o\
> +x86/imgutils.o  \
>  x86/lls_init.o  \
>  
>  OBJS-$(CONFIG_PIXELUTILS) += x86/pixelutils_init.o  \
> diff --git a/libavutil/x86/imgutils.c b/libavutil/x86/imgutils.c
> new file mode 100644
> index 000..8b3ed0f
> --- /dev/null
> +++ b/libavutil/x86/imgutils.c
> @@ -0,0 +1,95 @@
> +/*
> + * This file is part of FFmpeg.
> + *
> + * FFmpeg is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2.1 of the License, or (at your option) any later version.
> + *
> + * FFmpeg is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with FFmpeg; if not, write to the Free Software
> + * Foundation, Inc., 51

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-05-29 Thread Stefano Sabatini

On date Thursday 2015-05-28 18:02:34 -0300, James Almer encoded:
> On 28/05/15 2:39 PM, Stefano Sabatini wrote:
> > From f3b4e77dd9dd299aba8f4fa83625d2b61b243c3c Mon Sep 17 00:00:00 2001
> > From: Stefano Sabatini 
> > Date: Fri, 15 May 2015 18:58:17 +0200
> > Subject: [PATCH] lavu/imgutils: add av_image_copy_plane_from_uswc() 
> > function.
> > 
> > This function allows support to optimized GPU to CPU.
> > 
> > Based on code from vlc dxva2.c, commit 62107e56 by Laurent Aimar
> > .
> > 
> > TODO: fix integration with the build system, bump micro
> > 
> > Signed-off-by: Stefano Sabatini 
> > ---
> >  libavutil/imgutils.c  |  14 ++
> >  libavutil/imgutils.h  |  18 +++
> >  libavutil/imgutils_internal.h |  29 +++
> >  libavutil/x86/Makefile|   1 +
> >  libavutil/x86/imgutils.c  | 109 
> > ++
> >  5 files changed, 171 insertions(+)
> >  create mode 100644 libavutil/imgutils_internal.h
> >  create mode 100644 libavutil/x86/imgutils.c
> > 
> > diff --git a/libavutil/imgutils.c b/libavutil/imgutils.c
> > index ef0e671..e538c75 100644
> > --- a/libavutil/imgutils.c
> > +++ b/libavutil/imgutils.c
> > @@ -30,6 +30,7 @@
> >  #include "mathematics.h"
> >  #include "pixdesc.h"
> >  #include "rational.h"
> > +#include "imgutils_internal.h"
> >  
> >  void av_image_fill_max_pixsteps(int max_pixsteps[4], int 
> > max_pixstep_comps[4],
> >  const AVPixFmtDescriptor *pixdesc)
> > @@ -405,3 +406,16 @@ int av_image_copy_to_buffer(uint8_t *dst, int dst_size,
> >  
> >  return size;
> >  }
> > +
> > +void av_image_copy_plane_from_uswc(uint8_t *dst, size_t dst_linesize,
> > +  const uint8_t *src, size_t src_linesize,
> > +  unsigned bytewidth, unsigned height,
> > +  unsigned cpu_flags)
> > +{
> > +#ifndef HAVE_SSSE3
> 
> All HAVE_ are always defined to either 0 or 1.

Fixed.
 
> Nonetheless, this kind of check does not belong outside of arch folders. You 
> should
> check for ARCH_X86 to call functions in the x86/ folder. See lavc/lavfi for 
> examples.

I see, but I think this use case is pretty different. We don't have a
context where to set a function pointer, and I don't want to add a new
context and API for such things (but I'm open to suggestions). A
probably slightly ugly alternative could be to define a function such
as:
get_ff_image_copy_plane_from_uswc_fn()

returning a pointer to the correct function.

[...]
> > diff --git a/libavutil/x86/imgutils.c b/libavutil/x86/imgutils.c
> > new file mode 100644
> > index 000..91c7a42
> > --- /dev/null
> > +++ b/libavutil/x86/imgutils.c
> > @@ -0,0 +1,109 @@
> > +/*
> > + * This file is part of FFmpeg.
> > + *
> > + * FFmpeg is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU Lesser General Public
> > + * License as published by the Free Software Foundation; either
> > + * version 2.1 of the License, or (at your option) any later version.
> > + *
> > + * FFmpeg is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > + * Lesser General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU Lesser General Public
> > + * License along with FFmpeg; if not, write to the Free Software
> > + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 
> > 02110-1301 USA
> > + */
> > +
> > +#include 
> > +#include "config.h"
> > +#include "libavutil/attributes.h"
> > +#include "libavutil/avassert.h"
> > +#include "libavutil/intreadwrite.h"
> > +#include "libavutil/x86/asm.h"
> > +#include "libavutil/x86/cpu.h"
> > +#include "libavutil/cpu.h"
> > +#include "libavutil/pixdesc.h"
> > +
> > +#include "libavutil/avassert.h"
> > +#include "libavutil/x86/asm.h"
> > +#include "libavutil/imgutils.h"
> > +#include "libavutil/imgutils_internal.h"
> > +
> > +#ifdef HAVE_SSE2
> > +/* Copy 16/64 bytes from srcp to dstp loading data with the SSE>=2 
> > instruction
> > + * load and storing data with the SSE>=2 instruction store.
> > + */
> > +#define COPY16(dstp, srcp, load, store) \
> > +__asm__ volatile (  \
> > +load "  0(%[src]), %%xmm1\n"\
> > +store " %%xmm1,0(%[dst])\n" \
> > +: : [dst]"r"(dstp), [src]"r"(srcp) : "memory", "xmm1")
> > +
> > +#define COPY64(dstp, srcp, load, store) \
> > +__asm__ volatile (  \
> > +load "  0(%[src]), %%xmm1\n"\
> > +load " 16(%[src]), %%xmm2\n"\
> > +load " 32(%[src]), %%xmm3\n"\
> > +load " 48(%[src]), %%xmm4\n"\
> > +store " %%xmm1,0(%[dst])\n" \
> > +store " %%xmm2,   16(%[dst])\n" \
> > +store " %%xmm3,   32(%[dst])\n" \
> > +store " %%xmm4,   48(%[dst])\n" \
> > +

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-05-28 Thread James Almer

On 28/05/15 2:39 PM, Stefano Sabatini wrote:
> From f3b4e77dd9dd299aba8f4fa83625d2b61b243c3c Mon Sep 17 00:00:00 2001
> From: Stefano Sabatini 
> Date: Fri, 15 May 2015 18:58:17 +0200
> Subject: [PATCH] lavu/imgutils: add av_image_copy_plane_from_uswc() function.
> 
> This function allows support to optimized GPU to CPU.
> 
> Based on code from vlc dxva2.c, commit 62107e56 by Laurent Aimar
> .
> 
> TODO: fix integration with the build system, bump micro
> 
> Signed-off-by: Stefano Sabatini 
> ---
>  libavutil/imgutils.c  |  14 ++
>  libavutil/imgutils.h  |  18 +++
>  libavutil/imgutils_internal.h |  29 +++
>  libavutil/x86/Makefile|   1 +
>  libavutil/x86/imgutils.c  | 109 
> ++
>  5 files changed, 171 insertions(+)
>  create mode 100644 libavutil/imgutils_internal.h
>  create mode 100644 libavutil/x86/imgutils.c
> 
> diff --git a/libavutil/imgutils.c b/libavutil/imgutils.c
> index ef0e671..e538c75 100644
> --- a/libavutil/imgutils.c
> +++ b/libavutil/imgutils.c
> @@ -30,6 +30,7 @@
>  #include "mathematics.h"
>  #include "pixdesc.h"
>  #include "rational.h"
> +#include "imgutils_internal.h"
>  
>  void av_image_fill_max_pixsteps(int max_pixsteps[4], int 
> max_pixstep_comps[4],
>  const AVPixFmtDescriptor *pixdesc)
> @@ -405,3 +406,16 @@ int av_image_copy_to_buffer(uint8_t *dst, int dst_size,
>  
>  return size;
>  }
> +
> +void av_image_copy_plane_from_uswc(uint8_t *dst, size_t dst_linesize,
> +const uint8_t *src, size_t src_linesize,
> +unsigned bytewidth, unsigned height,
> +unsigned cpu_flags)
> +{
> +#ifndef HAVE_SSSE3

All HAVE_ are always defined to either 0 or 1.

Nonetheless, this kind of check does not belong outside of arch folders. You 
should
check for ARCH_X86 to call functions in the x86/ folder. See lavc/lavfi for 
examples.

> +av_unused(cpu_flags);
> +av_image_copy_plane(dst, dst_linesize, src, src_linesize, bytewidth, 
> height);
> +#else
> +ff_image_copy_plane_from_uswc_x86(dst, dst_linesize, src, src_linesize, 
> bytewidth, height, cpu_flags);
> +#endif
> +}
> diff --git a/libavutil/imgutils.h b/libavutil/imgutils.h
> index 23282a3..82c3826 100644
> --- a/libavutil/imgutils.h
> +++ b/libavutil/imgutils.h
> @@ -111,6 +111,24 @@ void av_image_copy_plane(uint8_t   *dst, int 
> dst_linesize,
>   int bytewidth, int height);
>  
>  /**
> + * Copy image plane from src to dst, similar to av_image_copy_plane().
> + * src must be an USWC buffer.
> + * It performs optimized copy from "Uncacheable Speculative Write
> + * Combining" memory as used by some video surface.
> + * It is really efficient only when SSE4.1 is available.
> + *
> + * In case the target CPU does not support USWC caching this function
> + * will be equivalent to av_image_copy_plane().
> + *
> + * @param cpu_flags as returned by av_get_cpu_flags()
> + * @see av_image_copy_plane()
> + */
> +void av_image_copy_plane_from_uswc(uint8_t *dst, size_t dst_linesize,
> +   const uint8_t *src, size_t src_linesize,
> +   unsigned bytewidth, unsigned height,
> +   unsigned cpu_flags);
> +
> +/**
>   * Copy image in src_data to dst_data.
>   *
>   * @param dst_linesizes linesizes for the image in dst_data
> diff --git a/libavutil/imgutils_internal.h b/libavutil/imgutils_internal.h
> new file mode 100644
> index 000..16ed977
> --- /dev/null
> +++ b/libavutil/imgutils_internal.h
> @@ -0,0 +1,29 @@
> +/*
> + * This file is part of FFmpeg.
> + *
> + * FFmpeg is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2.1 of the License, or (at your option) any later version.
> + *
> + * FFmpeg is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with FFmpeg; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 
> USA
> + */
> +
> +#ifndef AVUTIL_IMGUTILS_INTERNAL_H
> +#define AVUTIL_IMGUTILS_INTERNAL_H
> +
> +#include "imgutils.h"
> +
> +void ff_image_copy_plane_from_uswc_x86(uint8_t *dst, size_t dst_linesize,
> +const uint8_t *src, size_t src_linesize,
> +unsigned bytewidth, unsigned height,
> +unsigned cpu_flags);
> +
> +#endif /* AVUTIL_IMGUTILS_INTERNAL_H */
> diff --git a/libavutil/x86/Makefile b/libavutil/x

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-05-28 Thread Hendrik Leppkes

On Thu, May 28, 2015 at 7:39 PM, Stefano Sabatini  wrote:
> On date Monday 2015-05-18 13:26:56 +0200, Stefano Sabatini encoded:
>> On Mon, May 18, 2015 at 1:17 PM, Hendrik Leppkes 
>> wrote:
>>
>> > On Mon, May 18, 2015 at 12:37 PM, Stefano Sabatini 
>> > wrote:
>> >
>> [...]
>>
>> > >
>> > > I have a first hackish patch, performed some tests and I got some
>> > > significant performance gains, on my iCore5 with Intel Graphics HD4000 I
>> > > have now the same performance as the software decoder using DXVA2 for
>> > > decoding a H.264 1920x1080 video, but using only a single thread. The
>> > patch
>> > > as is is a hack, since I had to modify the compilation flags to enable
>> > > assembly compilation in the ffmpeg_dxva2.c file. I should probably create
>> > > an optimized copy function in libavutil, comments are welcome.
>> >
>> > FWIW, I never saw any benefits from using a small cache over simply
>> > copying directly to the destination memory, that could potentially
>> > simplify this a bit.
>> >
>>
>>
>> > And yeah, its a huge hack, we don't want new inline assembly.
>> >
>>
>> The sanest approach is probably to add a function to libavutil. The
>> optimized copy would then be accessible to third-party library users, with
>> no assembly hacks involved.
>
> New patch attached, it's still somehow hackish, please advice if you
> consider this approach acceptable.
>

The general concept is fine, but it should not use inline asm, and
someone will want to argue about the name and placement etc... :)
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-05-28 Thread Stefano Sabatini

On date Monday 2015-05-18 13:26:56 +0200, Stefano Sabatini encoded:
> On Mon, May 18, 2015 at 1:17 PM, Hendrik Leppkes 
> wrote:
> 
> > On Mon, May 18, 2015 at 12:37 PM, Stefano Sabatini 
> > wrote:
> >
> [...]
> 
> > >
> > > I have a first hackish patch, performed some tests and I got some
> > > significant performance gains, on my iCore5 with Intel Graphics HD4000 I
> > > have now the same performance as the software decoder using DXVA2 for
> > > decoding a H.264 1920x1080 video, but using only a single thread. The
> > patch
> > > as is is a hack, since I had to modify the compilation flags to enable
> > > assembly compilation in the ffmpeg_dxva2.c file. I should probably create
> > > an optimized copy function in libavutil, comments are welcome.
> >
> > FWIW, I never saw any benefits from using a small cache over simply
> > copying directly to the destination memory, that could potentially
> > simplify this a bit.
> >
> 
> 
> > And yeah, its a huge hack, we don't want new inline assembly.
> >
> 
> The sanest approach is probably to add a function to libavutil. The
> optimized copy would then be accessible to third-party library users, with
> no assembly hacks involved.

New patch attached, it's still somehow hackish, please advice if you
consider this approach acceptable.
-- 
FFmpeg = Formidable and Friendly MultiPurpose Explosive Game
>From f3b4e77dd9dd299aba8f4fa83625d2b61b243c3c Mon Sep 17 00:00:00 2001
From: Stefano Sabatini 
Date: Fri, 15 May 2015 18:58:17 +0200
Subject: [PATCH] lavu/imgutils: add av_image_copy_plane_from_uswc() function.

This function allows support to optimized GPU to CPU.

Based on code from vlc dxva2.c, commit 62107e56 by Laurent Aimar
.

TODO: fix integration with the build system, bump micro

Signed-off-by: Stefano Sabatini 
---
 libavutil/imgutils.c  |  14 ++
 libavutil/imgutils.h  |  18 +++
 libavutil/imgutils_internal.h |  29 +++
 libavutil/x86/Makefile|   1 +
 libavutil/x86/imgutils.c  | 109 ++
 5 files changed, 171 insertions(+)
 create mode 100644 libavutil/imgutils_internal.h
 create mode 100644 libavutil/x86/imgutils.c

diff --git a/libavutil/imgutils.c b/libavutil/imgutils.c
index ef0e671..e538c75 100644
--- a/libavutil/imgutils.c
+++ b/libavutil/imgutils.c
@@ -30,6 +30,7 @@
 #include "mathematics.h"
 #include "pixdesc.h"
 #include "rational.h"
+#include "imgutils_internal.h"
 
 void av_image_fill_max_pixsteps(int max_pixsteps[4], int max_pixstep_comps[4],
 const AVPixFmtDescriptor *pixdesc)
@@ -405,3 +406,16 @@ int av_image_copy_to_buffer(uint8_t *dst, int dst_size,
 
 return size;
 }
+
+void av_image_copy_plane_from_uswc(uint8_t *dst, size_t dst_linesize,
+   const uint8_t *src, size_t src_linesize,
+   unsigned bytewidth, unsigned height,
+   unsigned cpu_flags)
+{
+#ifndef HAVE_SSSE3
+av_unused(cpu_flags);
+av_image_copy_plane(dst, dst_linesize, src, src_linesize, bytewidth, height);
+#else
+ff_image_copy_plane_from_uswc_x86(dst, dst_linesize, src, src_linesize, bytewidth, height, cpu_flags);
+#endif
+}
diff --git a/libavutil/imgutils.h b/libavutil/imgutils.h
index 23282a3..82c3826 100644
--- a/libavutil/imgutils.h
+++ b/libavutil/imgutils.h
@@ -111,6 +111,24 @@ void av_image_copy_plane(uint8_t   *dst, int dst_linesize,
  int bytewidth, int height);
 
 /**
+ * Copy image plane from src to dst, similar to av_image_copy_plane().
+ * src must be an USWC buffer.
+ * It performs optimized copy from "Uncacheable Speculative Write
+ * Combining" memory as used by some video surface.
+ * It is really efficient only when SSE4.1 is available.
+ *
+ * In case the target CPU does not support USWC caching this function
+ * will be equivalent to av_image_copy_plane().
+ *
+ * @param cpu_flags as returned by av_get_cpu_flags()
+ * @see av_image_copy_plane()
+ */
+void av_image_copy_plane_from_uswc(uint8_t *dst, size_t dst_linesize,
+   const uint8_t *src, size_t src_linesize,
+   unsigned bytewidth, unsigned height,
+   unsigned cpu_flags);
+
+/**
  * Copy image in src_data to dst_data.
  *
  * @param dst_linesizes linesizes for the image in dst_data
diff --git a/libavutil/imgutils_internal.h b/libavutil/imgutils_internal.h
new file mode 100644
index 000..16ed977
--- /dev/null
+++ b/libavutil/imgutils_internal.h
@@ -0,0 +1,29 @@
+/*
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-05-18 Thread Hendrik Leppkes

On Mon, May 18, 2015 at 9:41 PM, Reimar Döffinger
 wrote:
>
>
> On 18.05.2015, at 12:37, Stefano Sabatini  wrote:
>
>> On Thu, May 14, 2015 at 2:52 PM, Stefano Sabatini 
>> wrote:
>>
>>> On date Thursday 2015-05-14 13:01:51 +0200, Stefano Sabatini encoded:
 On date Tuesday 2015-05-12 15:54:17 +0200, Hendrik Leppkes encoded:
>>> [...]
> One limitation is as the manual said, it needs to be copied from the
> GPU to system memory. ffmpeg_dxva2.c does not implement a optimized
> copy function for this, it uses plain old memcpy.
> Intel introduced a new instruction for this in SSE4, MOVNTDQA, which
> is optimized for copying from USWC memory (Uncacheable Speculative
> Write Combining) to system memory. Using this may help speed up the
> process significantly, and VLC probably uses it.

 Now the question is, how would be possible to optimize GPU to CPU copy
 to get an overall performance gain? At least VLC seems able to get
 better performances when using HW decoding, but I'm not sure it is
 copying decoded data back to the CPU (indeed it may perform direct
 rendering).
>>>
>>> Self-reply:
>>> commit 62107e563f979c638f9a5f58cdfd5639d9c63ac7
>>> Author: Laurent Aimar 
>>> Date:   Tue Nov 17 01:09:43 2009 +0100
>>>
>>>Improved performance when copying video surface in dxva2.
>>>
>>> That is, VLC is using optimized GPU->CPU copy when the relevant SSE2
>>> instructions are available.
>>>
>>
>> I have a first hackish patch, performed some tests and I got some
>> significant performance gains, on my iCore5 with Intel Graphics HD4000 I
>> have now the same performance as the software decoder using DXVA2 for
>> decoding a H.264 1920x1080 video, but using only a single thread. The patch
>> as is is a hack, since I had to modify the compilation flags to enable
>> assembly compilation in the ffmpeg_dxva2.c file. I should probably create
>> an optimized copy function in libavutil, comments are welcome.
>
> What exactly is SSE4 needed for?

MOVNTDQA, its specifically designed for just this task.

> Both non-temporal movs and prefetches existed before it, so if that is 
> critical for performance the fallback implementation is bad.

A SSE2 implementation may or may not be faster than plain memcpy, that
depends on memcpy. In my tests on Windows, a SSE2 implementation was
usually not worth it.

> However possibly more important: why is a memcpy needed at all?

For any further processing, you need the frame data. And trying to use
the frame data directly from the locked surfaces for eg. an encoder is
very inefficient (possibly random access pattern), so it needs to be
copied into normal memory first.

- Hendrik
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-05-18 Thread Reimar Döffinger



On 18.05.2015, at 12:37, Stefano Sabatini  wrote:

> On Thu, May 14, 2015 at 2:52 PM, Stefano Sabatini 
> wrote:
> 
>> On date Thursday 2015-05-14 13:01:51 +0200, Stefano Sabatini encoded:
>>> On date Tuesday 2015-05-12 15:54:17 +0200, Hendrik Leppkes encoded:
>> [...]
 One limitation is as the manual said, it needs to be copied from the
 GPU to system memory. ffmpeg_dxva2.c does not implement a optimized
 copy function for this, it uses plain old memcpy.
 Intel introduced a new instruction for this in SSE4, MOVNTDQA, which
 is optimized for copying from USWC memory (Uncacheable Speculative
 Write Combining) to system memory. Using this may help speed up the
 process significantly, and VLC probably uses it.
>>> 
>>> Now the question is, how would be possible to optimize GPU to CPU copy
>>> to get an overall performance gain? At least VLC seems able to get
>>> better performances when using HW decoding, but I'm not sure it is
>>> copying decoded data back to the CPU (indeed it may perform direct
>>> rendering).
>> 
>> Self-reply:
>> commit 62107e563f979c638f9a5f58cdfd5639d9c63ac7
>> Author: Laurent Aimar 
>> Date:   Tue Nov 17 01:09:43 2009 +0100
>> 
>>Improved performance when copying video surface in dxva2.
>> 
>> That is, VLC is using optimized GPU->CPU copy when the relevant SSE2
>> instructions are available.
>> 
> 
> I have a first hackish patch, performed some tests and I got some
> significant performance gains, on my iCore5 with Intel Graphics HD4000 I
> have now the same performance as the software decoder using DXVA2 for
> decoding a H.264 1920x1080 video, but using only a single thread. The patch
> as is is a hack, since I had to modify the compilation flags to enable
> assembly compilation in the ffmpeg_dxva2.c file. I should probably create
> an optimized copy function in libavutil, comments are welcome.

What exactly is SSE4 needed for?
Both non-temporal movs and prefetches existed before it, so if that is critical 
for performance the fallback implementation is bad.
However possibly more important: why is a memcpy needed at all?
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-05-18 Thread Stefano Sabatini

On Mon, May 18, 2015 at 1:17 PM, Hendrik Leppkes 
wrote:

> On Mon, May 18, 2015 at 12:37 PM, Stefano Sabatini 
> wrote:
>
[...]

> >
> > I have a first hackish patch, performed some tests and I got some
> > significant performance gains, on my iCore5 with Intel Graphics HD4000 I
> > have now the same performance as the software decoder using DXVA2 for
> > decoding a H.264 1920x1080 video, but using only a single thread. The
> patch
> > as is is a hack, since I had to modify the compilation flags to enable
> > assembly compilation in the ffmpeg_dxva2.c file. I should probably create
> > an optimized copy function in libavutil, comments are welcome.
>
> FWIW, I never saw any benefits from using a small cache over simply
> copying directly to the destination memory, that could potentially
> simplify this a bit.
>


> And yeah, its a huge hack, we don't want new inline assembly.
>

The sanest approach is probably to add a function to libavutil. The
optimized copy would then be accessible to third-party library users, with
no assembly hacks involved.
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-05-18 Thread Hendrik Leppkes

On Mon, May 18, 2015 at 12:37 PM, Stefano Sabatini  wrote:
> On Thu, May 14, 2015 at 2:52 PM, Stefano Sabatini 
> wrote:
>
>> On date Thursday 2015-05-14 13:01:51 +0200, Stefano Sabatini encoded:
>> > On date Tuesday 2015-05-12 15:54:17 +0200, Hendrik Leppkes encoded:
>> [...]
>> > > One limitation is as the manual said, it needs to be copied from the
>> > > GPU to system memory. ffmpeg_dxva2.c does not implement a optimized
>> > > copy function for this, it uses plain old memcpy.
>> > > Intel introduced a new instruction for this in SSE4, MOVNTDQA, which
>> > > is optimized for copying from USWC memory (Uncacheable Speculative
>> > > Write Combining) to system memory. Using this may help speed up the
>> > > process significantly, and VLC probably uses it.
>> >
>> > Now the question is, how would be possible to optimize GPU to CPU copy
>> > to get an overall performance gain? At least VLC seems able to get
>> > better performances when using HW decoding, but I'm not sure it is
>> > copying decoded data back to the CPU (indeed it may perform direct
>> > rendering).
>>
>> Self-reply:
>> commit 62107e563f979c638f9a5f58cdfd5639d9c63ac7
>> Author: Laurent Aimar 
>> Date:   Tue Nov 17 01:09:43 2009 +0100
>>
>> Improved performance when copying video surface in dxva2.
>>
>> That is, VLC is using optimized GPU->CPU copy when the relevant SSE2
>> instructions are available.
>>
>
> I have a first hackish patch, performed some tests and I got some
> significant performance gains, on my iCore5 with Intel Graphics HD4000 I
> have now the same performance as the software decoder using DXVA2 for
> decoding a H.264 1920x1080 video, but using only a single thread. The patch
> as is is a hack, since I had to modify the compilation flags to enable
> assembly compilation in the ffmpeg_dxva2.c file. I should probably create
> an optimized copy function in libavutil, comments are welcome.

FWIW, I never saw any benefits from using a small cache over simply
copying directly to the destination memory, that could potentially
simplify this a bit.
And yeah, its a huge hack, we don't want new inline assembly.

- Hendrik
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-05-18 Thread Stefano Sabatini

On Thu, May 14, 2015 at 2:52 PM, Stefano Sabatini 
wrote:

> On date Thursday 2015-05-14 13:01:51 +0200, Stefano Sabatini encoded:
> > On date Tuesday 2015-05-12 15:54:17 +0200, Hendrik Leppkes encoded:
> [...]
> > > One limitation is as the manual said, it needs to be copied from the
> > > GPU to system memory. ffmpeg_dxva2.c does not implement a optimized
> > > copy function for this, it uses plain old memcpy.
> > > Intel introduced a new instruction for this in SSE4, MOVNTDQA, which
> > > is optimized for copying from USWC memory (Uncacheable Speculative
> > > Write Combining) to system memory. Using this may help speed up the
> > > process significantly, and VLC probably uses it.
> >
> > Now the question is, how would be possible to optimize GPU to CPU copy
> > to get an overall performance gain? At least VLC seems able to get
> > better performances when using HW decoding, but I'm not sure it is
> > copying decoded data back to the CPU (indeed it may perform direct
> > rendering).
>
> Self-reply:
> commit 62107e563f979c638f9a5f58cdfd5639d9c63ac7
> Author: Laurent Aimar 
> Date:   Tue Nov 17 01:09:43 2009 +0100
>
> Improved performance when copying video surface in dxva2.
>
> That is, VLC is using optimized GPU->CPU copy when the relevant SSE2
> instructions are available.
>

I have a first hackish patch, performed some tests and I got some
significant performance gains, on my iCore5 with Intel Graphics HD4000 I
have now the same performance as the software decoder using DXVA2 for
decoding a H.264 1920x1080 video, but using only a single thread. The patch
as is is a hack, since I had to modify the compilation flags to enable
assembly compilation in the ffmpeg_dxva2.c file. I should probably create
an optimized copy function in libavutil, comments are welcome.

The IDirect3D9_CreateDevice(... GetShellWindow ...) -> ..GetDesktopWindow
change is required to make it compile under MinGW (with MinGW64 it is
probably not required, I still have to switch to MinGW64 but allowing MinGW
compilation is still worthwhile).

0001-ffmpeg_dxva.c-add-support-to-optimized-GPU-to-CPU-co.patch
Description: Binary data
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-05-14 Thread Hendrik Leppkes

On Thu, May 14, 2015 at 2:52 PM, Stefano Sabatini  wrote:
> On date Thursday 2015-05-14 13:01:51 +0200, Stefano Sabatini encoded:
>> On date Tuesday 2015-05-12 15:54:17 +0200, Hendrik Leppkes encoded:
> [...]
>> > One limitation is as the manual said, it needs to be copied from the
>> > GPU to system memory. ffmpeg_dxva2.c does not implement a optimized
>> > copy function for this, it uses plain old memcpy.
>> > Intel introduced a new instruction for this in SSE4, MOVNTDQA, which
>> > is optimized for copying from USWC memory (Uncacheable Speculative
>> > Write Combining) to system memory. Using this may help speed up the
>> > process significantly, and VLC probably uses it.
>>
>> Now the question is, how would be possible to optimize GPU to CPU copy
>> to get an overall performance gain? At least VLC seems able to get
>> better performances when using HW decoding, but I'm not sure it is
>> copying decoded data back to the CPU (indeed it may perform direct
>> rendering).
>
> Self-reply:
> commit 62107e563f979c638f9a5f58cdfd5639d9c63ac7
> Author: Laurent Aimar 
> Date:   Tue Nov 17 01:09:43 2009 +0100
>
> Improved performance when copying video surface in dxva2.
>
> That is, VLC is using optimized GPU->CPU copy when the relevant SSE2
> instructions are available.

Actually the real proper instructions are SSE4.1, using SSE2 would
only be a small advantage over memcpy.

- Hendrik
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-05-14 Thread wm4

On Thu, 14 May 2015 14:52:29 +0200
Stefano Sabatini  wrote:

> On date Thursday 2015-05-14 13:01:51 +0200, Stefano Sabatini encoded:
> > On date Tuesday 2015-05-12 15:54:17 +0200, Hendrik Leppkes encoded:
> [...]
> > > One limitation is as the manual said, it needs to be copied from the
> > > GPU to system memory. ffmpeg_dxva2.c does not implement a optimized
> > > copy function for this, it uses plain old memcpy.
> > > Intel introduced a new instruction for this in SSE4, MOVNTDQA, which
> > > is optimized for copying from USWC memory (Uncacheable Speculative
> > > Write Combining) to system memory. Using this may help speed up the
> > > process significantly, and VLC probably uses it.
> > 
> > Now the question is, how would be possible to optimize GPU to CPU copy
> > to get an overall performance gain? At least VLC seems able to get
> > better performances when using HW decoding, but I'm not sure it is
> > copying decoded data back to the CPU (indeed it may perform direct
> > rendering).
> 
> Self-reply:
> commit 62107e563f979c638f9a5f58cdfd5639d9c63ac7
> Author: Laurent Aimar 
> Date:   Tue Nov 17 01:09:43 2009 +0100
> 
> Improved performance when copying video surface in dxva2.
> 
> That is, VLC is using optimized GPU->CPU copy when the relevant SSE2
> instructions are available.

Here's what lavfilters appears to use:

http://git.1f0.de/gitweb?p=lavfsplitter.git;a=blob;f=common/DSUtilLite/gpu_memcpy_sse4.h
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-05-14 Thread Stefano Sabatini

On date Thursday 2015-05-14 13:01:51 +0200, Stefano Sabatini encoded:
> On date Tuesday 2015-05-12 15:54:17 +0200, Hendrik Leppkes encoded:
[...]
> > One limitation is as the manual said, it needs to be copied from the
> > GPU to system memory. ffmpeg_dxva2.c does not implement a optimized
> > copy function for this, it uses plain old memcpy.
> > Intel introduced a new instruction for this in SSE4, MOVNTDQA, which
> > is optimized for copying from USWC memory (Uncacheable Speculative
> > Write Combining) to system memory. Using this may help speed up the
> > process significantly, and VLC probably uses it.
> 
> Now the question is, how would be possible to optimize GPU to CPU copy
> to get an overall performance gain? At least VLC seems able to get
> better performances when using HW decoding, but I'm not sure it is
> copying decoded data back to the CPU (indeed it may perform direct
> rendering).

Self-reply:
commit 62107e563f979c638f9a5f58cdfd5639d9c63ac7
Author: Laurent Aimar 
Date:   Tue Nov 17 01:09:43 2009 +0100

Improved performance when copying video surface in dxva2.

That is, VLC is using optimized GPU->CPU copy when the relevant SSE2
instructions are available.
-- 
FFmpeg = Fundamental & Frightening Mean Peaceful EniGma
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-05-14 Thread Stefano Sabatini

On date Tuesday 2015-05-12 15:54:17 +0200, Hendrik Leppkes encoded:
> On Tue, May 12, 2015 at 3:33 PM, Stefano Sabatini  wrote:
[...]
> > There are some cases when DXVA2 (or in general HW decoding) can be
> > used effectively in ffmpeg? Can you tell if there is something which
> > could be improved in the current ffmpeg_dxva2.c implementation? (My
> > guess is that this code is somehow based on the VLC code).
> 
> Its not based on the VLC code, its roughly based on code from my own
> project that uses ffmpeg for DXVA2, but really, the workflow is going
> to be pretty similar in any implementation either way, since the MS
> API dictates that, more or less.
> 
> DXVA2 decoding can be faster then software decoding, depending on your 
> hardware.
> 
> If you used a low-end Intel CPU, say a Pentium or i3 (Ivy or Haswell),
> or use a recent NVIDIA GPU (Kepler or Maxwell), then DXVA2 decoding on
> the GPU can potentially give you ~400 fps for 1080p, while the CPU
> will likely not manage that.
> On a high-end CPU, the software decoder can potentially exceed that, however.
> 
> One limitation is as the manual said, it needs to be copied from the
> GPU to system memory. ffmpeg_dxva2.c does not implement a optimized
> copy function for this, it uses plain old memcpy.
> Intel introduced a new instruction for this in SSE4, MOVNTDQA, which
> is optimized for copying from USWC memory (Uncacheable Speculative
> Write Combining) to system memory. Using this may help speed up the
> process significantly, and VLC probably uses it.

Now the question is, how would be possible to optimize GPU to CPU copy
to get an overall performance gain? At least VLC seems able to get
better performances when using HW decoding, but I'm not sure it is
copying decoded data back to the CPU (indeed it may perform direct
rendering).
 
> The original primary goal of this code was however to be able to test
> and debug the hwaccels much easier, and not directly to provide a
> playback/transcoding feature, so such optimizations were not performed
> for brevity.
[...]

Thanks.
-- 
FFmpeg = Fanciful & Faithless Merciless Powerful EntanGlement
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-05-12 Thread Hendrik Leppkes

On Tue, May 12, 2015 at 3:33 PM, Stefano Sabatini  wrote:
> Hi guys,
>
> I'm playing with DXVA2 hardware decoding on Windows, and these are my
> findings.
>
> DVXA2 decoding was enabled in avconv/ffmpeg through the commit:
>
> commit 35177ba77ff60a8b8839783f57e44bcc4214507a
> Author: Hendrik Leppkes 
> Date:   Tue Apr 22 15:22:53 2014 +0200
>
> avconv: add support for DXVA2 decoding
>
> Signed-off-by: Anton Khirnov 
>
> DXVA2 decoding is enabled when a dxva2api.h header is found in the
> path. From my understanding the header is provided by VLC:
> http://download.videolan.org/pub/contrib/dxva2api.h
>
> (I suppose the header was created in order to make compilation work
> with MinGW). When compiling with MinGW from mingw.org I had to change
> the GetShellWindow call in the line:
>
> hr = IDirect3D9_CreateDevice(ctx->d3d9, adapter, D3DDEVTYPE_HAL, 
> GetShellWindow(),
>  D3DCREATE_SOFTWARE_VERTEXPROCESSING | 
> D3DCREATE_MULTITHREADED | D3DCREATE_FPU_PRESERVE,
>  &d3dpp, &ctx->d3d9device);
>
> to GetDesktopWindow in the ffmpeg_dxva2.c file. I applied the fix
> suggested here:
> http://ffmpeg.org/pipermail/libav-user/2014-December/007673.html

You should use mingw-w64, it provides both a dxva2api.h and can
compile the code without any modifications.
Using the "original" mingw32 is not recommended, and barely supported.

>
> Then I performed some tests with the command:
> ffmpeg -hwaccel dxva2 INPUT -threads 1 -f null -
>
> The -threads 1 option seems required or ffmpeg will fail with decoding
> errors.

Indeed, multi-threading with hwaccel is not something that should be
used, as it will break, although the API allows it for BS reasons.
There wouldn't be a performance improvement either way.

>
> In the ffmpeg(1) manual I can read this big warning:
>  Note that most acceleration methods are intended for playback and
>  will not be faster than software decoding on modern
>  CPUs. Additionally, ffmpeg will usually need to copy the decoded
>  frames from the GPU memory into the system memory, resulting in
>  further performance loss. This option is thus mainly useful for
>  testing.
>
> I tested with several HW combinations, and I always find that pure
> software decoding is always several time faster than DXVA2
> decoding. In some cases I got invalid output (same with VLC) which may
> be related to a problem in the graphics card or driver (a VIA VX900).

I don't think I've ever tested on such a chip. I didn't even know VIA
still made PC hardware.
Therefor,I have no idea how fast/slow or compatible it is.

>
> On the other hand when testing with VLC I noticed better performances
> (in general, a significantly reduced usage of the CPU, usually of an
> order of 3), so I have to conclude that at least VLC is able to make
> good use of DXVA2 hardware acceleration.
>
> I'm aware that the need to copy GPU data back to the CPU memory as
> required by ffmpeg defeats the advantage (if any) of hardware
> decoding, especially given that multithreading decoding cannot be
> adopted with DXVA2.
>
> My questions are:
>
> There are some cases when DXVA2 (or in general HW decoding) can be
> used effectively in ffmpeg? Can you tell if there is something which
> could be improved in the current ffmpeg_dxva2.c implementation? (My
> guess is that this code is somehow based on the VLC code).

Its not based on the VLC code, its roughly based on code from my own
project that uses ffmpeg for DXVA2, but really, the workflow is going
to be pretty similar in any implementation either way, since the MS
API dictates that, more or less.

DXVA2 decoding can be faster then software decoding, depending on your hardware.

If you used a low-end Intel CPU, say a Pentium or i3 (Ivy or Haswell),
or use a recent NVIDIA GPU (Kepler or Maxwell), then DXVA2 decoding on
the GPU can potentially give you ~400 fps for 1080p, while the CPU
will likely not manage that.
On a high-end CPU, the software decoder can potentially exceed that, however.

One limitation is as the manual said, it needs to be copied from the
GPU to system memory. ffmpeg_dxva2.c does not implement a optimized
copy function for this, it uses plain old memcpy.
Intel introduced a new instruction for this in SSE4, MOVNTDQA, which
is optimized for copying from USWC memory (Uncacheable Speculative
Write Combining) to system memory. Using this may help speed up the
process significantly, and VLC probably uses it.

The original primary goal of this code was however to be able to test
and debug the hwaccels much easier, and not directly to provide a
playback/transcoding feature, so such optimizations were not performed
for brevity.

- Hendrik
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

[FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

2015-05-12 Thread Stefano Sabatini

Hi guys,

I'm playing with DXVA2 hardware decoding on Windows, and these are my
findings.

DVXA2 decoding was enabled in avconv/ffmpeg through the commit:

commit 35177ba77ff60a8b8839783f57e44bcc4214507a
Author: Hendrik Leppkes 
Date:   Tue Apr 22 15:22:53 2014 +0200

avconv: add support for DXVA2 decoding

Signed-off-by: Anton Khirnov 

DXVA2 decoding is enabled when a dxva2api.h header is found in the
path. From my understanding the header is provided by VLC:
http://download.videolan.org/pub/contrib/dxva2api.h

(I suppose the header was created in order to make compilation work
with MinGW). When compiling with MinGW from mingw.org I had to change
the GetShellWindow call in the line:

hr = IDirect3D9_CreateDevice(ctx->d3d9, adapter, D3DDEVTYPE_HAL, 
GetShellWindow(),
 D3DCREATE_SOFTWARE_VERTEXPROCESSING | 
D3DCREATE_MULTITHREADED | D3DCREATE_FPU_PRESERVE,
 &d3dpp, &ctx->d3d9device);

to GetDesktopWindow in the ffmpeg_dxva2.c file. I applied the fix
suggested here:
http://ffmpeg.org/pipermail/libav-user/2014-December/007673.html

Then I performed some tests with the command:
ffmpeg -hwaccel dxva2 INPUT -threads 1 -f null -

The -threads 1 option seems required or ffmpeg will fail with decoding
errors.

In the ffmpeg(1) manual I can read this big warning:
 Note that most acceleration methods are intended for playback and
 will not be faster than software decoding on modern
 CPUs. Additionally, ffmpeg will usually need to copy the decoded
 frames from the GPU memory into the system memory, resulting in
 further performance loss. This option is thus mainly useful for
 testing.

I tested with several HW combinations, and I always find that pure
software decoding is always several time faster than DXVA2
decoding. In some cases I got invalid output (same with VLC) which may
be related to a problem in the graphics card or driver (a VIA VX900).

On the other hand when testing with VLC I noticed better performances
(in general, a significantly reduced usage of the CPU, usually of an
order of 3), so I have to conclude that at least VLC is able to make
good use of DXVA2 hardware acceleration.

I'm aware that the need to copy GPU data back to the CPU memory as
required by ffmpeg defeats the advantage (if any) of hardware
decoding, especially given that multithreading decoding cannot be
adopted with DXVA2.

My questions are:

There are some cases when DXVA2 (or in general HW decoding) can be
used effectively in ffmpeg? Can you tell if there is something which
could be improved in the current ffmpeg_dxva2.c implementation? (My
guess is that this code is somehow based on the VLC code).

Would it make sense to integrate DXVA2 decoding in ffplay.c, assuming
it would be worth the effort, at least for testing/didactic purposes?

Related resources:
https://trac.ffmpeg.org/ticket/604
https://ffmpeg.org/pipermail/ffmpeg-user/2012-May/006600.html
http://forum.doom9.org/showthread.php?t=170793

TIA for any comments.
-- 
FFmpeg = Fostering and Fantastic Maxi Picky Erudite God
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

[FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg

29 matches

Site Navigation

Mail list logo

Footer information