Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg
On Tue, Jun 16, 2015 at 2:30 PM, Stefano Sabatini wrote: > On date Tuesday 2015-06-16 14:16:11 +0200, Gwenole Beauchesne encoded: >> Hi, >> >> 2015-06-16 14:03 GMT+02:00 Michael Niedermayer : > [...] >> >> +#if HAVE_SSE2 >> >> +/* Copy 16/64 bytes from srcp to dstp loading data with the SSE>=2 >> >> instruction >> >> + * load and storing data with the SSE>=2 instruction store. >> >> + */ >> >> +#define COPY16(dstp, srcp, load, store) \ >> >> +__asm__ volatile ( \ >> >> +load " 0(%[src]), %%xmm1\n"\ >> >> +store " %%xmm1,0(%[dst])\n" \ >> >> +: : [dst]"r"(dstp), [src]"r"(srcp) : "memory", "xmm1") >> >> + >> >> +#define COPY64(dstp, srcp, load, store) \ >> >> +__asm__ volatile ( \ >> >> +load " 0(%[src]), %%xmm1\n"\ >> >> +load " 16(%[src]), %%xmm2\n"\ >> >> +load " 32(%[src]), %%xmm3\n"\ >> >> +load " 48(%[src]), %%xmm4\n"\ >> >> +store " %%xmm1,0(%[dst])\n" \ >> >> +store " %%xmm2, 16(%[dst])\n" \ >> >> +store " %%xmm3, 32(%[dst])\n" \ >> >> +store " %%xmm4, 48(%[dst])\n" \ >> >> +: : [dst]"r"(dstp), [src]"r"(srcp) : "memory", "xmm1", "xmm2", >> >> "xmm3", "xmm4") >> >> +#endif >> >> + >> >> +#define COPY_LINE(dstp, srcp, size, load) \ >> >> +const unsigned unaligned = (-(uintptr_t)srcp) & 0x0f; \ >> >> +unsigned x = unaligned; \ >> >> +\ >> >> +av_assert0(((intptr_t)dstp & 0x0f) == 0); \ >> >> +\ >> >> +__asm__ volatile ("mfence");\ >> >> +if (!unaligned) { \ >> >> +for (; x+63 < size; x += 64)\ >> >> +COPY64(&dstp[x], &srcp[x], load, "movdqa"); \ >> >> +} else {\ >> >> +COPY16(dst, src, "movdqu", "movdqa"); \ >> >> +for (; x+63 < size; x += 64)\ >> >> +COPY64(&dstp[x], &srcp[x], load, "movdqu"); \ >> > >> > to use SSE registers in inline asm operands or clobber list you need >> > to build with -msse (which probably is default on on x86-64) >> > >> > files build with -msse will result in undefined behavior if anything >> > in them is executed on a pre SSE cpu, as these allow gcc to put >> > SSE instructions directly in the code where it likes >> > >> > The way out of this "design" is not to tell gcc that it passes >> > a string with SSE code to the assembler >> > that is not to use SSE registers in operands and not to put them >> > on the clobber list unless gcc actually is in SSE mode and can use and >> > need them there. >> > see XMM_CLOBBERS* >> >> Well, from past experience, lying to gcc is generally not a good thing >> either. There are multiple interesting ways it could fail from time to >> time. :) >> >> Other approaches: >> - With GCC >= 4.4, you can use __attribute__((target(T))) where T = >> "ssse3", "sse4.1", etc. This is the easiest way ; >> - Split into several separate files per target. Though, one would then >> argue that while we are at it why not just start moving to yasm. >> > >> The former approach looks more appealing to me, considering there may >> be an effort to migrate to yasm afterwards. > > I plan to port this patch to yasm. I'll ask for help on IRC since > probably it will take too much time otherwise without any guidance. > -- If you accept a few restrictions (like requiring aligned and padded input/output) and maybe give it a more specific name so that people won't try to replace generic memcpy with it, yasm'ing it would be pretty simple. If you want it to be generic like the C version, supporting unaligned and whatnot, the asm is going to get a bit more verbose.. I could probably whip up a basic implementation of the restricted version, and the yasm experts can make suggestions on improvements then. - Hendrik ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg
On Tue, 16 Jun 2015 14:16:11 +0200 Gwenole Beauchesne wrote: > Hi, > > 2015-06-16 14:03 GMT+02:00 Michael Niedermayer : > > On Tue, Jun 16, 2015 at 10:35:52AM +0200, Stefano Sabatini wrote: > >> On date Tuesday 2015-06-16 10:20:31 +0200, wm4 encoded: > >> > On Mon, 15 Jun 2015 17:55:35 +0200 > >> > Stefano Sabatini wrote: > >> > > >> > > On date Monday 2015-06-15 11:56:13 +0200, Stefano Sabatini encoded: > >> > > [...] > >> > > > From 3a75ef1e86360cd6f30b8e550307404d0d1c1dba Mon Sep 17 00:00:00 > >> > > > 2001 > >> > > > From: Stefano Sabatini > >> > > > Date: Mon, 15 Jun 2015 11:02:50 +0200 > >> > > > Subject: [PATCH] lavu/mem: add av_memcpynt() function with x86 > >> > > > optimizations > >> > > > > >> > > > Assembly based on code from vlc dxva2.c, commit 62107e56 by Laurent > >> > > > Aimar > >> > > > . > >> > > > > >> > > > TODO: bump minor, update APIchanges > >> > > > --- > >> > > > libavutil/mem.c | 9 + > >> > > > libavutil/mem.h | 14 > >> > > > libavutil/mem_internal.h | 26 +++ > >> > > > libavutil/x86/Makefile | 1 + > >> > > > libavutil/x86/mem.c | 85 > >> > > > > >> > > > 5 files changed, 135 insertions(+) > >> > > > create mode 100644 libavutil/mem_internal.h > >> > > > create mode 100644 libavutil/x86/mem.c > >> > > > > >> > > > diff --git a/libavutil/mem.c b/libavutil/mem.c > >> > > > index da291fb..0e1eb01 100644 > >> > > > --- a/libavutil/mem.c > >> > > > +++ b/libavutil/mem.c > >> > > > @@ -42,6 +42,7 @@ > >> > > > #include "dynarray.h" > >> > > > #include "intreadwrite.h" > >> > > > #include "mem.h" > >> > > > +#include "mem_internal.h" > >> > > > > >> > > > #ifdef MALLOC_PREFIX > >> > > > > >> > > > @@ -515,3 +516,11 @@ void av_fast_malloc(void *ptr, unsigned int > >> > > > *size, size_t min_size) > >> > > > ff_fast_malloc(ptr, size, min_size, 0); > >> > > > } > >> > > > > >> > > > +void av_memcpynt(void *dst, const void *src, size_t size, int > >> > > > cpu_flags) > >> > > > +{ > >> > > > +#if ARCH_X86 > >> > > > +ff_memcpynt_x86(dst, src, size, cpu_flags); > >> > > > +#else > >> > > > +memcpy(dst, src, size, cpu_flags); > >> > > > +#endif > >> > > > +} > >> > > > >> > > Alternatively, what about something like: > >> > > > >> > > av_memcpynt_fn av_memcpynt_get_fn(void); > >> > > > >> > > modeled after av_pixelutils_get_sad_fn()? This would skip the need for > >> > > a wrapper calling the right function. > >> > > >> > >> > I don't see much value in this, unless determining the right function > >> > causes too much overhead. > >> > >> I see two advantages, 1. no branch and function call when the function > >> is called, 2. the cpu_flags must not be passed around, so it's somehow > >> safer. > >> > >> I have no strong preference though, updated (untested patch) in > >> attachment. > >> -- > >> FFmpeg = Fierce and Forgiving Merciless Powered Extroverse Gargoyle > > > >> mem.c |9 + > >> mem.h | 13 +++ > >> mem_internal.h | 26 +++ > >> x86/Makefile |1 > >> x86/mem.c | 98 > >> + > >> 5 files changed, 147 insertions(+) > >> f536b25834e0927b8cab5c996042aae697b8d773 > >> 0003-lavu-mem-add-av_memcpynt_get_fn.patch > >> From c005ff5405dd48e6b0fed24ed94947f69bfe2783 Mon Sep 17 00:00:00 2001 > >> From: Stefano Sabatini > >> Date: Mon, 15 Jun 2015 11:02:50 +0200 > >> Subject: [PATCH] lavu/mem: add av_memcpynt_get_fn() > >> > >> Assembly based on code from vlc dxva2.c, commit 62107e56 by Laurent Aimar > >> . > >> > >> TODO: remove use of inline assembly, bump minor, update APIchanges > >> --- > >> libavutil/mem.c | 9 + > >> libavutil/mem.h | 13 +++ > >> libavutil/mem_internal.h | 26 + > >> libavutil/x86/Makefile | 1 + > >> libavutil/x86/mem.c | 98 > >> > >> 5 files changed, 147 insertions(+) > >> create mode 100644 libavutil/mem_internal.h > >> create mode 100644 libavutil/x86/mem.c > >> > >> diff --git a/libavutil/mem.c b/libavutil/mem.c > >> index da291fb..325bfc9 100644 > >> --- a/libavutil/mem.c > >> +++ b/libavutil/mem.c > >> @@ -42,6 +42,7 @@ > >> #include "dynarray.h" > >> #include "intreadwrite.h" > >> #include "mem.h" > >> +#include "mem_internal.h" > >> > >> #ifdef MALLOC_PREFIX > >> > >> @@ -515,3 +516,11 @@ void av_fast_malloc(void *ptr, unsigned int *size, > >> size_t min_size) > >> ff_fast_malloc(ptr, size, min_size, 0); > >> } > >> > >> +av_memcpynt_fn av_memcpynt_get_fn(void) > >> +{ > >> +#if ARCH_X86 > >> +return ff_memcpynt_get_fn_x86(); > >> +#else > >> +return memcpy; > >> +#endif > >> +} > >> diff --git a/libavutil/mem.h b/libavutil/mem.h > >> index 2a1e36d..d9f1b7a 100644 > >> --- a/libavutil/mem.h > >> +++ b/libavutil/mem.h > >> @@ -382,6 +382,19 @@ void *av_fast_realloc(void *ptr, unsigned int *size
Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg
On date Tuesday 2015-06-16 14:16:11 +0200, Gwenole Beauchesne encoded: > Hi, > > 2015-06-16 14:03 GMT+02:00 Michael Niedermayer : [...] > >> +#if HAVE_SSE2 > >> +/* Copy 16/64 bytes from srcp to dstp loading data with the SSE>=2 > >> instruction > >> + * load and storing data with the SSE>=2 instruction store. > >> + */ > >> +#define COPY16(dstp, srcp, load, store) \ > >> +__asm__ volatile ( \ > >> +load " 0(%[src]), %%xmm1\n"\ > >> +store " %%xmm1,0(%[dst])\n" \ > >> +: : [dst]"r"(dstp), [src]"r"(srcp) : "memory", "xmm1") > >> + > >> +#define COPY64(dstp, srcp, load, store) \ > >> +__asm__ volatile ( \ > >> +load " 0(%[src]), %%xmm1\n"\ > >> +load " 16(%[src]), %%xmm2\n"\ > >> +load " 32(%[src]), %%xmm3\n"\ > >> +load " 48(%[src]), %%xmm4\n"\ > >> +store " %%xmm1,0(%[dst])\n" \ > >> +store " %%xmm2, 16(%[dst])\n" \ > >> +store " %%xmm3, 32(%[dst])\n" \ > >> +store " %%xmm4, 48(%[dst])\n" \ > >> +: : [dst]"r"(dstp), [src]"r"(srcp) : "memory", "xmm1", "xmm2", > >> "xmm3", "xmm4") > >> +#endif > >> + > >> +#define COPY_LINE(dstp, srcp, size, load) \ > >> +const unsigned unaligned = (-(uintptr_t)srcp) & 0x0f; \ > >> +unsigned x = unaligned; \ > >> +\ > >> +av_assert0(((intptr_t)dstp & 0x0f) == 0); \ > >> +\ > >> +__asm__ volatile ("mfence");\ > >> +if (!unaligned) { \ > >> +for (; x+63 < size; x += 64)\ > >> +COPY64(&dstp[x], &srcp[x], load, "movdqa"); \ > >> +} else {\ > >> +COPY16(dst, src, "movdqu", "movdqa"); \ > >> +for (; x+63 < size; x += 64)\ > >> +COPY64(&dstp[x], &srcp[x], load, "movdqu"); \ > > > > to use SSE registers in inline asm operands or clobber list you need > > to build with -msse (which probably is default on on x86-64) > > > > files build with -msse will result in undefined behavior if anything > > in them is executed on a pre SSE cpu, as these allow gcc to put > > SSE instructions directly in the code where it likes > > > > The way out of this "design" is not to tell gcc that it passes > > a string with SSE code to the assembler > > that is not to use SSE registers in operands and not to put them > > on the clobber list unless gcc actually is in SSE mode and can use and > > need them there. > > see XMM_CLOBBERS* > > Well, from past experience, lying to gcc is generally not a good thing > either. There are multiple interesting ways it could fail from time to > time. :) > > Other approaches: > - With GCC >= 4.4, you can use __attribute__((target(T))) where T = > "ssse3", "sse4.1", etc. This is the easiest way ; > - Split into several separate files per target. Though, one would then > argue that while we are at it why not just start moving to yasm. > > The former approach looks more appealing to me, considering there may > be an effort to migrate to yasm afterwards. I plan to port this patch to yasm. I'll ask for help on IRC since probably it will take too much time otherwise without any guidance. -- FFmpeg = Friendly and Fancy Mind-dumbing Pacific Easy Generator ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg
Hi, 2015-06-16 14:03 GMT+02:00 Michael Niedermayer : > On Tue, Jun 16, 2015 at 10:35:52AM +0200, Stefano Sabatini wrote: >> On date Tuesday 2015-06-16 10:20:31 +0200, wm4 encoded: >> > On Mon, 15 Jun 2015 17:55:35 +0200 >> > Stefano Sabatini wrote: >> > >> > > On date Monday 2015-06-15 11:56:13 +0200, Stefano Sabatini encoded: >> > > [...] >> > > > From 3a75ef1e86360cd6f30b8e550307404d0d1c1dba Mon Sep 17 00:00:00 2001 >> > > > From: Stefano Sabatini >> > > > Date: Mon, 15 Jun 2015 11:02:50 +0200 >> > > > Subject: [PATCH] lavu/mem: add av_memcpynt() function with x86 >> > > > optimizations >> > > > >> > > > Assembly based on code from vlc dxva2.c, commit 62107e56 by Laurent >> > > > Aimar >> > > > . >> > > > >> > > > TODO: bump minor, update APIchanges >> > > > --- >> > > > libavutil/mem.c | 9 + >> > > > libavutil/mem.h | 14 >> > > > libavutil/mem_internal.h | 26 +++ >> > > > libavutil/x86/Makefile | 1 + >> > > > libavutil/x86/mem.c | 85 >> > > > >> > > > 5 files changed, 135 insertions(+) >> > > > create mode 100644 libavutil/mem_internal.h >> > > > create mode 100644 libavutil/x86/mem.c >> > > > >> > > > diff --git a/libavutil/mem.c b/libavutil/mem.c >> > > > index da291fb..0e1eb01 100644 >> > > > --- a/libavutil/mem.c >> > > > +++ b/libavutil/mem.c >> > > > @@ -42,6 +42,7 @@ >> > > > #include "dynarray.h" >> > > > #include "intreadwrite.h" >> > > > #include "mem.h" >> > > > +#include "mem_internal.h" >> > > > >> > > > #ifdef MALLOC_PREFIX >> > > > >> > > > @@ -515,3 +516,11 @@ void av_fast_malloc(void *ptr, unsigned int >> > > > *size, size_t min_size) >> > > > ff_fast_malloc(ptr, size, min_size, 0); >> > > > } >> > > > >> > > > +void av_memcpynt(void *dst, const void *src, size_t size, int >> > > > cpu_flags) >> > > > +{ >> > > > +#if ARCH_X86 >> > > > +ff_memcpynt_x86(dst, src, size, cpu_flags); >> > > > +#else >> > > > +memcpy(dst, src, size, cpu_flags); >> > > > +#endif >> > > > +} >> > > >> > > Alternatively, what about something like: >> > > >> > > av_memcpynt_fn av_memcpynt_get_fn(void); >> > > >> > > modeled after av_pixelutils_get_sad_fn()? This would skip the need for >> > > a wrapper calling the right function. >> > >> >> > I don't see much value in this, unless determining the right function >> > causes too much overhead. >> >> I see two advantages, 1. no branch and function call when the function >> is called, 2. the cpu_flags must not be passed around, so it's somehow >> safer. >> >> I have no strong preference though, updated (untested patch) in >> attachment. >> -- >> FFmpeg = Fierce and Forgiving Merciless Powered Extroverse Gargoyle > >> mem.c |9 + >> mem.h | 13 +++ >> mem_internal.h | 26 +++ >> x86/Makefile |1 >> x86/mem.c | 98 >> + >> 5 files changed, 147 insertions(+) >> f536b25834e0927b8cab5c996042aae697b8d773 >> 0003-lavu-mem-add-av_memcpynt_get_fn.patch >> From c005ff5405dd48e6b0fed24ed94947f69bfe2783 Mon Sep 17 00:00:00 2001 >> From: Stefano Sabatini >> Date: Mon, 15 Jun 2015 11:02:50 +0200 >> Subject: [PATCH] lavu/mem: add av_memcpynt_get_fn() >> >> Assembly based on code from vlc dxva2.c, commit 62107e56 by Laurent Aimar >> . >> >> TODO: remove use of inline assembly, bump minor, update APIchanges >> --- >> libavutil/mem.c | 9 + >> libavutil/mem.h | 13 +++ >> libavutil/mem_internal.h | 26 + >> libavutil/x86/Makefile | 1 + >> libavutil/x86/mem.c | 98 >> >> 5 files changed, 147 insertions(+) >> create mode 100644 libavutil/mem_internal.h >> create mode 100644 libavutil/x86/mem.c >> >> diff --git a/libavutil/mem.c b/libavutil/mem.c >> index da291fb..325bfc9 100644 >> --- a/libavutil/mem.c >> +++ b/libavutil/mem.c >> @@ -42,6 +42,7 @@ >> #include "dynarray.h" >> #include "intreadwrite.h" >> #include "mem.h" >> +#include "mem_internal.h" >> >> #ifdef MALLOC_PREFIX >> >> @@ -515,3 +516,11 @@ void av_fast_malloc(void *ptr, unsigned int *size, >> size_t min_size) >> ff_fast_malloc(ptr, size, min_size, 0); >> } >> >> +av_memcpynt_fn av_memcpynt_get_fn(void) >> +{ >> +#if ARCH_X86 >> +return ff_memcpynt_get_fn_x86(); >> +#else >> +return memcpy; >> +#endif >> +} >> diff --git a/libavutil/mem.h b/libavutil/mem.h >> index 2a1e36d..d9f1b7a 100644 >> --- a/libavutil/mem.h >> +++ b/libavutil/mem.h >> @@ -382,6 +382,19 @@ void *av_fast_realloc(void *ptr, unsigned int *size, >> size_t min_size); >> */ >> void av_fast_malloc(void *ptr, unsigned int *size, size_t min_size); >> >> +typedef void* (*av_memcpynt_fn)(void *dst, const void *src, size_t size); >> + >> +/** >> + * Return possibly optimized function to copy size bytes from from src >> + * to dst, using non-temporal copy. >> + * >> + * The returned function w
Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg
Hi, 2015-06-16 10:35 GMT+02:00 Stefano Sabatini : > On date Tuesday 2015-06-16 10:20:31 +0200, wm4 encoded: >> On Mon, 15 Jun 2015 17:55:35 +0200 >> Stefano Sabatini wrote: >> >> > On date Monday 2015-06-15 11:56:13 +0200, Stefano Sabatini encoded: >> > [...] >> > > From 3a75ef1e86360cd6f30b8e550307404d0d1c1dba Mon Sep 17 00:00:00 2001 >> > > From: Stefano Sabatini >> > > Date: Mon, 15 Jun 2015 11:02:50 +0200 >> > > Subject: [PATCH] lavu/mem: add av_memcpynt() function with x86 >> > > optimizations >> > > >> > > Assembly based on code from vlc dxva2.c, commit 62107e56 by Laurent Aimar >> > > . >> > > >> > > TODO: bump minor, update APIchanges >> > > --- >> > > libavutil/mem.c | 9 + >> > > libavutil/mem.h | 14 >> > > libavutil/mem_internal.h | 26 +++ >> > > libavutil/x86/Makefile | 1 + >> > > libavutil/x86/mem.c | 85 >> > > >> > > 5 files changed, 135 insertions(+) >> > > create mode 100644 libavutil/mem_internal.h >> > > create mode 100644 libavutil/x86/mem.c >> > > >> > > diff --git a/libavutil/mem.c b/libavutil/mem.c >> > > index da291fb..0e1eb01 100644 >> > > --- a/libavutil/mem.c >> > > +++ b/libavutil/mem.c >> > > @@ -42,6 +42,7 @@ >> > > #include "dynarray.h" >> > > #include "intreadwrite.h" >> > > #include "mem.h" >> > > +#include "mem_internal.h" >> > > >> > > #ifdef MALLOC_PREFIX >> > > >> > > @@ -515,3 +516,11 @@ void av_fast_malloc(void *ptr, unsigned int *size, >> > > size_t min_size) >> > > ff_fast_malloc(ptr, size, min_size, 0); >> > > } >> > > >> > > +void av_memcpynt(void *dst, const void *src, size_t size, int cpu_flags) >> > > +{ >> > > +#if ARCH_X86 >> > > +ff_memcpynt_x86(dst, src, size, cpu_flags); >> > > +#else >> > > +memcpy(dst, src, size, cpu_flags); >> > > +#endif >> > > +} >> > >> > Alternatively, what about something like: >> > >> > av_memcpynt_fn av_memcpynt_get_fn(void); >> > >> > modeled after av_pixelutils_get_sad_fn()? This would skip the need for >> > a wrapper calling the right function. >> > >> I don't see much value in this, unless determining the right function >> causes too much overhead. > > I see two advantages, 1. no branch and function call when the function > is called, 2. the cpu_flags must not be passed around, so it's somehow > safer. Interesting approach. You probably could also use something similar to sws context you build up based on surface size, and other characteristics (flags)? Regards, -- Gwenole Beauchesne Intel Corporation SAS / 2 rue de Paris, 92196 Meudon Cedex, France Registration Number (RCS): Nanterre B 302 456 199 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg
On Tue, Jun 16, 2015 at 10:35:52AM +0200, Stefano Sabatini wrote: > On date Tuesday 2015-06-16 10:20:31 +0200, wm4 encoded: > > On Mon, 15 Jun 2015 17:55:35 +0200 > > Stefano Sabatini wrote: > > > > > On date Monday 2015-06-15 11:56:13 +0200, Stefano Sabatini encoded: > > > [...] > > > > From 3a75ef1e86360cd6f30b8e550307404d0d1c1dba Mon Sep 17 00:00:00 2001 > > > > From: Stefano Sabatini > > > > Date: Mon, 15 Jun 2015 11:02:50 +0200 > > > > Subject: [PATCH] lavu/mem: add av_memcpynt() function with x86 > > > > optimizations > > > > > > > > Assembly based on code from vlc dxva2.c, commit 62107e56 by Laurent > > > > Aimar > > > > . > > > > > > > > TODO: bump minor, update APIchanges > > > > --- > > > > libavutil/mem.c | 9 + > > > > libavutil/mem.h | 14 > > > > libavutil/mem_internal.h | 26 +++ > > > > libavutil/x86/Makefile | 1 + > > > > libavutil/x86/mem.c | 85 > > > > > > > > 5 files changed, 135 insertions(+) > > > > create mode 100644 libavutil/mem_internal.h > > > > create mode 100644 libavutil/x86/mem.c > > > > > > > > diff --git a/libavutil/mem.c b/libavutil/mem.c > > > > index da291fb..0e1eb01 100644 > > > > --- a/libavutil/mem.c > > > > +++ b/libavutil/mem.c > > > > @@ -42,6 +42,7 @@ > > > > #include "dynarray.h" > > > > #include "intreadwrite.h" > > > > #include "mem.h" > > > > +#include "mem_internal.h" > > > > > > > > #ifdef MALLOC_PREFIX > > > > > > > > @@ -515,3 +516,11 @@ void av_fast_malloc(void *ptr, unsigned int *size, > > > > size_t min_size) > > > > ff_fast_malloc(ptr, size, min_size, 0); > > > > } > > > > > > > > +void av_memcpynt(void *dst, const void *src, size_t size, int > > > > cpu_flags) > > > > +{ > > > > +#if ARCH_X86 > > > > +ff_memcpynt_x86(dst, src, size, cpu_flags); > > > > +#else > > > > +memcpy(dst, src, size, cpu_flags); > > > > +#endif > > > > +} > > > > > > Alternatively, what about something like: > > > > > > av_memcpynt_fn av_memcpynt_get_fn(void); > > > > > > modeled after av_pixelutils_get_sad_fn()? This would skip the need for > > > a wrapper calling the right function. > > > > > I don't see much value in this, unless determining the right function > > causes too much overhead. > > I see two advantages, 1. no branch and function call when the function > is called, 2. the cpu_flags must not be passed around, so it's somehow > safer. > > I have no strong preference though, updated (untested patch) in > attachment. > -- > FFmpeg = Fierce and Forgiving Merciless Powered Extroverse Gargoyle > mem.c |9 + > mem.h | 13 +++ > mem_internal.h | 26 +++ > x86/Makefile |1 > x86/mem.c | 98 > + > 5 files changed, 147 insertions(+) > f536b25834e0927b8cab5c996042aae697b8d773 > 0003-lavu-mem-add-av_memcpynt_get_fn.patch > From c005ff5405dd48e6b0fed24ed94947f69bfe2783 Mon Sep 17 00:00:00 2001 > From: Stefano Sabatini > Date: Mon, 15 Jun 2015 11:02:50 +0200 > Subject: [PATCH] lavu/mem: add av_memcpynt_get_fn() > > Assembly based on code from vlc dxva2.c, commit 62107e56 by Laurent Aimar > . > > TODO: remove use of inline assembly, bump minor, update APIchanges > --- > libavutil/mem.c | 9 + > libavutil/mem.h | 13 +++ > libavutil/mem_internal.h | 26 + > libavutil/x86/Makefile | 1 + > libavutil/x86/mem.c | 98 > > 5 files changed, 147 insertions(+) > create mode 100644 libavutil/mem_internal.h > create mode 100644 libavutil/x86/mem.c > > diff --git a/libavutil/mem.c b/libavutil/mem.c > index da291fb..325bfc9 100644 > --- a/libavutil/mem.c > +++ b/libavutil/mem.c > @@ -42,6 +42,7 @@ > #include "dynarray.h" > #include "intreadwrite.h" > #include "mem.h" > +#include "mem_internal.h" > > #ifdef MALLOC_PREFIX > > @@ -515,3 +516,11 @@ void av_fast_malloc(void *ptr, unsigned int *size, > size_t min_size) > ff_fast_malloc(ptr, size, min_size, 0); > } > > +av_memcpynt_fn av_memcpynt_get_fn(void) > +{ > +#if ARCH_X86 > +return ff_memcpynt_get_fn_x86(); > +#else > +return memcpy; > +#endif > +} > diff --git a/libavutil/mem.h b/libavutil/mem.h > index 2a1e36d..d9f1b7a 100644 > --- a/libavutil/mem.h > +++ b/libavutil/mem.h > @@ -382,6 +382,19 @@ void *av_fast_realloc(void *ptr, unsigned int *size, > size_t min_size); > */ > void av_fast_malloc(void *ptr, unsigned int *size, size_t min_size); > > +typedef void* (*av_memcpynt_fn)(void *dst, const void *src, size_t size); > + > +/** > + * Return possibly optimized function to copy size bytes from from src > + * to dst, using non-temporal copy. > + * > + * The returned function works as memcpy, but adopts non-temporal > + * instructios when available. This can lead to better performances > + * when transferring data from source to destination is e
Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg
On date Tuesday 2015-06-16 10:20:31 +0200, wm4 encoded: > On Mon, 15 Jun 2015 17:55:35 +0200 > Stefano Sabatini wrote: > > > On date Monday 2015-06-15 11:56:13 +0200, Stefano Sabatini encoded: > > [...] > > > From 3a75ef1e86360cd6f30b8e550307404d0d1c1dba Mon Sep 17 00:00:00 2001 > > > From: Stefano Sabatini > > > Date: Mon, 15 Jun 2015 11:02:50 +0200 > > > Subject: [PATCH] lavu/mem: add av_memcpynt() function with x86 > > > optimizations > > > > > > Assembly based on code from vlc dxva2.c, commit 62107e56 by Laurent Aimar > > > . > > > > > > TODO: bump minor, update APIchanges > > > --- > > > libavutil/mem.c | 9 + > > > libavutil/mem.h | 14 > > > libavutil/mem_internal.h | 26 +++ > > > libavutil/x86/Makefile | 1 + > > > libavutil/x86/mem.c | 85 > > > > > > 5 files changed, 135 insertions(+) > > > create mode 100644 libavutil/mem_internal.h > > > create mode 100644 libavutil/x86/mem.c > > > > > > diff --git a/libavutil/mem.c b/libavutil/mem.c > > > index da291fb..0e1eb01 100644 > > > --- a/libavutil/mem.c > > > +++ b/libavutil/mem.c > > > @@ -42,6 +42,7 @@ > > > #include "dynarray.h" > > > #include "intreadwrite.h" > > > #include "mem.h" > > > +#include "mem_internal.h" > > > > > > #ifdef MALLOC_PREFIX > > > > > > @@ -515,3 +516,11 @@ void av_fast_malloc(void *ptr, unsigned int *size, > > > size_t min_size) > > > ff_fast_malloc(ptr, size, min_size, 0); > > > } > > > > > > +void av_memcpynt(void *dst, const void *src, size_t size, int cpu_flags) > > > +{ > > > +#if ARCH_X86 > > > +ff_memcpynt_x86(dst, src, size, cpu_flags); > > > +#else > > > +memcpy(dst, src, size, cpu_flags); > > > +#endif > > > +} > > > > Alternatively, what about something like: > > > > av_memcpynt_fn av_memcpynt_get_fn(void); > > > > modeled after av_pixelutils_get_sad_fn()? This would skip the need for > > a wrapper calling the right function. > > I don't see much value in this, unless determining the right function > causes too much overhead. I see two advantages, 1. no branch and function call when the function is called, 2. the cpu_flags must not be passed around, so it's somehow safer. I have no strong preference though, updated (untested patch) in attachment. -- FFmpeg = Fierce and Forgiving Merciless Powered Extroverse Gargoyle >From c005ff5405dd48e6b0fed24ed94947f69bfe2783 Mon Sep 17 00:00:00 2001 From: Stefano Sabatini Date: Mon, 15 Jun 2015 11:02:50 +0200 Subject: [PATCH] lavu/mem: add av_memcpynt_get_fn() Assembly based on code from vlc dxva2.c, commit 62107e56 by Laurent Aimar . TODO: remove use of inline assembly, bump minor, update APIchanges --- libavutil/mem.c | 9 + libavutil/mem.h | 13 +++ libavutil/mem_internal.h | 26 + libavutil/x86/Makefile | 1 + libavutil/x86/mem.c | 98 5 files changed, 147 insertions(+) create mode 100644 libavutil/mem_internal.h create mode 100644 libavutil/x86/mem.c diff --git a/libavutil/mem.c b/libavutil/mem.c index da291fb..325bfc9 100644 --- a/libavutil/mem.c +++ b/libavutil/mem.c @@ -42,6 +42,7 @@ #include "dynarray.h" #include "intreadwrite.h" #include "mem.h" +#include "mem_internal.h" #ifdef MALLOC_PREFIX @@ -515,3 +516,11 @@ void av_fast_malloc(void *ptr, unsigned int *size, size_t min_size) ff_fast_malloc(ptr, size, min_size, 0); } +av_memcpynt_fn av_memcpynt_get_fn(void) +{ +#if ARCH_X86 +return ff_memcpynt_get_fn_x86(); +#else +return memcpy; +#endif +} diff --git a/libavutil/mem.h b/libavutil/mem.h index 2a1e36d..d9f1b7a 100644 --- a/libavutil/mem.h +++ b/libavutil/mem.h @@ -382,6 +382,19 @@ void *av_fast_realloc(void *ptr, unsigned int *size, size_t min_size); */ void av_fast_malloc(void *ptr, unsigned int *size, size_t min_size); +typedef void* (*av_memcpynt_fn)(void *dst, const void *src, size_t size); + +/** + * Return possibly optimized function to copy size bytes from from src + * to dst, using non-temporal copy. + * + * The returned function works as memcpy, but adopts non-temporal + * instructios when available. This can lead to better performances + * when transferring data from source to destination is expensive, for + * example when reading from GPU memory. + */ +av_memcpynt_fn av_memcpynt_get_fn(void); + /** * @} */ diff --git a/libavutil/mem_internal.h b/libavutil/mem_internal.h new file mode 100644 index 000..de61cba --- /dev/null +++ b/libavutil/mem_internal.h @@ -0,0 +1,26 @@ +/* + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even
Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg
On Mon, 15 Jun 2015 17:55:35 +0200 Stefano Sabatini wrote: > On date Monday 2015-06-15 11:56:13 +0200, Stefano Sabatini encoded: > [...] > > From 3a75ef1e86360cd6f30b8e550307404d0d1c1dba Mon Sep 17 00:00:00 2001 > > From: Stefano Sabatini > > Date: Mon, 15 Jun 2015 11:02:50 +0200 > > Subject: [PATCH] lavu/mem: add av_memcpynt() function with x86 optimizations > > > > Assembly based on code from vlc dxva2.c, commit 62107e56 by Laurent Aimar > > . > > > > TODO: bump minor, update APIchanges > > --- > > libavutil/mem.c | 9 + > > libavutil/mem.h | 14 > > libavutil/mem_internal.h | 26 +++ > > libavutil/x86/Makefile | 1 + > > libavutil/x86/mem.c | 85 > > > > 5 files changed, 135 insertions(+) > > create mode 100644 libavutil/mem_internal.h > > create mode 100644 libavutil/x86/mem.c > > > > diff --git a/libavutil/mem.c b/libavutil/mem.c > > index da291fb..0e1eb01 100644 > > --- a/libavutil/mem.c > > +++ b/libavutil/mem.c > > @@ -42,6 +42,7 @@ > > #include "dynarray.h" > > #include "intreadwrite.h" > > #include "mem.h" > > +#include "mem_internal.h" > > > > #ifdef MALLOC_PREFIX > > > > @@ -515,3 +516,11 @@ void av_fast_malloc(void *ptr, unsigned int *size, > > size_t min_size) > > ff_fast_malloc(ptr, size, min_size, 0); > > } > > > > +void av_memcpynt(void *dst, const void *src, size_t size, int cpu_flags) > > +{ > > +#if ARCH_X86 > > +ff_memcpynt_x86(dst, src, size, cpu_flags); > > +#else > > +memcpy(dst, src, size, cpu_flags); > > +#endif > > +} > > Alternatively, what about something like: > > av_memcpynt_fn av_memcpynt_get_fn(void); > > modeled after av_pixelutils_get_sad_fn()? This would skip the need for > a wrapper calling the right function. I don't see much value in this, unless determining the right function causes too much overhead. ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg
On date Monday 2015-06-15 11:56:13 +0200, Stefano Sabatini encoded: [...] > From 3a75ef1e86360cd6f30b8e550307404d0d1c1dba Mon Sep 17 00:00:00 2001 > From: Stefano Sabatini > Date: Mon, 15 Jun 2015 11:02:50 +0200 > Subject: [PATCH] lavu/mem: add av_memcpynt() function with x86 optimizations > > Assembly based on code from vlc dxva2.c, commit 62107e56 by Laurent Aimar > . > > TODO: bump minor, update APIchanges > --- > libavutil/mem.c | 9 + > libavutil/mem.h | 14 > libavutil/mem_internal.h | 26 +++ > libavutil/x86/Makefile | 1 + > libavutil/x86/mem.c | 85 > > 5 files changed, 135 insertions(+) > create mode 100644 libavutil/mem_internal.h > create mode 100644 libavutil/x86/mem.c > > diff --git a/libavutil/mem.c b/libavutil/mem.c > index da291fb..0e1eb01 100644 > --- a/libavutil/mem.c > +++ b/libavutil/mem.c > @@ -42,6 +42,7 @@ > #include "dynarray.h" > #include "intreadwrite.h" > #include "mem.h" > +#include "mem_internal.h" > > #ifdef MALLOC_PREFIX > > @@ -515,3 +516,11 @@ void av_fast_malloc(void *ptr, unsigned int *size, > size_t min_size) > ff_fast_malloc(ptr, size, min_size, 0); > } > > +void av_memcpynt(void *dst, const void *src, size_t size, int cpu_flags) > +{ > +#if ARCH_X86 > +ff_memcpynt_x86(dst, src, size, cpu_flags); > +#else > +memcpy(dst, src, size, cpu_flags); > +#endif > +} Alternatively, what about something like: av_memcpynt_fn av_memcpynt_get_fn(void); modeled after av_pixelutils_get_sad_fn()? This would skip the need for a wrapper calling the right function. -- FFmpeg = Frightening and Fantastic Murdering Portentous Erratic Guru ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg
On date Saturday 2015-06-13 14:20:07 +0200, Hendrik Leppkes encoded: > On Thu, Jun 11, 2015 at 8:54 PM, wm4 wrote: > > On Thu, 11 Jun 2015 17:24:45 +0200 > > Stefano Sabatini wrote: > > > >> Next step would be the use of YASM, but I only want to test if the > >> general approach is fine (and if the API is not too specific). Also if > >> someone wants to step up and port it to YASM I'm all for it, since > >> ASM/YASM is far from being my area of expertise. > > > > Personally, I'd probably just > > 1. export the GPU memcpy function, and > > 2. export a function to copy AVFrames using this function > > I concur. A basic optimized memcpy with specific constraints (ie. > requires aligned input/output, always copies in 16-byte chunks, so > in/out buffers need to be padded appropriately), to keep the required > ASM code simple. > These constraints are generally always fulfilled if you have a GPU > frame on the input, since they will have appropriate strides (and if > in question, we control allocation of the GPU surfaces as well), and > we control the output memory buffer anyway. > > On top of that a convenience function that deals with pixel formats, > strides, planes, and whatnot, and then uses this function. > A generic C version of the basic copy function shouldn't be needed, we > could just use memcpy for that.. or a tiny wrapper that calls memcpy, > anyway. This is my first attempt, the added function is named av_memcpynt(), it is using inline assembly which should be replaced by yasm once me or someone else figures out how to do it. An av_image_copynt_plane() function can be built on top of that (but in this case it would be better to inline the av_memcpynt() function). BTW I dropped the requirement of 16-bits alignment on the size variable which is required by the VLC code but which looks unnecessary to me. -- FFmpeg = Furious and Foolish Marvellous Pacific Egregious Ghost >From 3a75ef1e86360cd6f30b8e550307404d0d1c1dba Mon Sep 17 00:00:00 2001 From: Stefano Sabatini Date: Mon, 15 Jun 2015 11:02:50 +0200 Subject: [PATCH] lavu/mem: add av_memcpynt() function with x86 optimizations Assembly based on code from vlc dxva2.c, commit 62107e56 by Laurent Aimar . TODO: bump minor, update APIchanges --- libavutil/mem.c | 9 + libavutil/mem.h | 14 libavutil/mem_internal.h | 26 +++ libavutil/x86/Makefile | 1 + libavutil/x86/mem.c | 85 5 files changed, 135 insertions(+) create mode 100644 libavutil/mem_internal.h create mode 100644 libavutil/x86/mem.c diff --git a/libavutil/mem.c b/libavutil/mem.c index da291fb..0e1eb01 100644 --- a/libavutil/mem.c +++ b/libavutil/mem.c @@ -42,6 +42,7 @@ #include "dynarray.h" #include "intreadwrite.h" #include "mem.h" +#include "mem_internal.h" #ifdef MALLOC_PREFIX @@ -515,3 +516,11 @@ void av_fast_malloc(void *ptr, unsigned int *size, size_t min_size) ff_fast_malloc(ptr, size, min_size, 0); } +void av_memcpynt(void *dst, const void *src, size_t size, int cpu_flags) +{ +#if ARCH_X86 +ff_memcpynt_x86(dst, src, size, cpu_flags); +#else +memcpy(dst, src, size, cpu_flags); +#endif +} diff --git a/libavutil/mem.h b/libavutil/mem.h index 2a1e36d..bbad313 100644 --- a/libavutil/mem.h +++ b/libavutil/mem.h @@ -383,6 +383,20 @@ void *av_fast_realloc(void *ptr, unsigned int *size, size_t min_size); void av_fast_malloc(void *ptr, unsigned int *size, size_t min_size); /** + * Copy size bytes from from src to dst, using non-temporal copy + * functions when available. + * + * This function works as memcpy, but adopts non-temporal instructios + * when available. This can lead to better performances when + * transferring data from source to destination is expensive, for + * example when reading from GPU memory. + * + * @param dst destination memory pointer, must be aligned to 16 bits + * @param cpu_flags as returned by av_get_cpu_flags() + */ +void av_memcpynt(void *dst, const void *src, size_t size, int cpu_flags); + +/** * @} */ diff --git a/libavutil/mem_internal.h b/libavutil/mem_internal.h new file mode 100644 index 000..371be31 --- /dev/null +++ b/libavutil/mem_internal.h @@ -0,0 +1,26 @@ +/* + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-
Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg
On Thu, Jun 11, 2015 at 8:54 PM, wm4 wrote: > On Thu, 11 Jun 2015 17:24:45 +0200 > Stefano Sabatini wrote: > >> Next step would be the use of YASM, but I only want to test if the >> general approach is fine (and if the API is not too specific). Also if >> someone wants to step up and port it to YASM I'm all for it, since >> ASM/YASM is far from being my area of expertise. > > Personally, I'd probably just > 1. export the GPU memcpy function, and > 2. export a function to copy AVFrames using this function I concur. A basic optimized memcpy with specific constraints (ie. requires aligned input/output, always copies in 16-byte chunks, so in/out buffers need to be padded appropriately), to keep the required ASM code simple. These constraints are generally always fulfilled if you have a GPU frame on the input, since they will have appropriate strides (and if in question, we control allocation of the GPU surfaces as well), and we control the output memory buffer anyway. On top of that a convenience function that deals with pixel formats, strides, planes, and whatnot, and then uses this function. A generic C version of the basic copy function shouldn't be needed, we could just use memcpy for that.. or a tiny wrapper that calls memcpy, anyway. - Hendrik ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg
On Thu, 11 Jun 2015 17:24:45 +0200 Stefano Sabatini wrote: > Next step would be the use of YASM, but I only want to test if the > general approach is fine (and if the API is not too specific). Also if > someone wants to step up and port it to YASM I'm all for it, since > ASM/YASM is far from being my area of expertise. Personally, I'd probably just 1. export the GPU memcpy function, and 2. export a function to copy AVFrames using this function ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg
On date Friday 2015-05-29 09:47:58 -0700, Timothy Gu encoded: > On Fri, May 29, 2015 at 03:49:22PM +0200, Stefano Sabatini wrote: [...] > > OBJS-$(CONFIG_PIXELUTILS) += x86/pixelutils_init.o \ > > diff --git a/libavutil/x86/imgutils.c b/libavutil/x86/imgutils.c > > new file mode 100644 > > index 000..8b3ed0f > > --- /dev/null > > +++ b/libavutil/x86/imgutils.c > > @@ -0,0 +1,95 @@ > > +/* > > + * This file is part of FFmpeg. > > + * > > + * FFmpeg is free software; you can redistribute it and/or > > + * modify it under the terms of the GNU Lesser General Public > > + * License as published by the Free Software Foundation; either > > + * version 2.1 of the License, or (at your option) any later version. > > + * > > + * FFmpeg is distributed in the hope that it will be useful, > > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > + * Lesser General Public License for more details. > > + * > > + * You should have received a copy of the GNU Lesser General Public > > + * License along with FFmpeg; if not, write to the Free Software > > + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA > > 02110-1301 USA > > + */ > > + > > +#include > > +#include "config.h" > > +#include "libavutil/avassert.h" > > +#include "libavutil/imgutils.h" > > +#include "libavutil/imgutils_internal.h" > > + > > +#if HAVE_SSE2 > > +/* Copy 16/64 bytes from srcp to dstp loading data with the SSE>=2 > > instruction > > + * load and storing data with the SSE>=2 instruction store. > > + */ > > +#define COPY16(dstp, srcp, load, store) \ > > +__asm__ volatile ( \ > > +load " 0(%[src]), %%xmm1\n"\ > > +store " %%xmm1,0(%[dst])\n" \ > > +: : [dst]"r"(dstp), [src]"r"(srcp) : "memory", "xmm1") > > + > > +#define COPY64(dstp, srcp, load, store) \ > > +__asm__ volatile ( \ > > +load " 0(%[src]), %%xmm1\n"\ > > +load " 16(%[src]), %%xmm2\n"\ > > +load " 32(%[src]), %%xmm3\n"\ > > +load " 48(%[src]), %%xmm4\n"\ > > +store " %%xmm1,0(%[dst])\n" \ > > +store " %%xmm2, 16(%[dst])\n" \ > > +store " %%xmm3, 32(%[dst])\n" \ > > +store " %%xmm4, 48(%[dst])\n" \ > > +: : [dst]"r"(dstp), [src]"r"(srcp) : "memory", "xmm1", "xmm2", > > "xmm3", "xmm4") > > +#endif > > + > > +void ff_image_copy_plane_from_uswc_x86(uint8_t *dst, size_t dst_linesize, > > + const uint8_t *src, size_t src_linesize, > > + unsigned bytewidth, unsigned height, > > + int cpu_flags) > > +{ > > +#if !HAVE_SSSE3 > > Are any SSSE3 instructions used? No. I re-checked, MOVDQA/MOVDQU were introduced in SSE2, MOVNTDQA in SSE4. > > +return av_image_copy_plane(dst, dst_linesize, src, src_linesize, > > bytewidth, height); > > +#endif > > + > > +av_assert0(((intptr_t)dst & 0x0f) == 0 && (dst_linesize & 0x0f) == 0); > > + > > +__asm__ volatile ("mfence"); > > + > > +for (unsigned y = 0; y < height; y++) { > > +const unsigned unaligned = (-(uintptr_t)src) & 0x0f; > > +unsigned x = unaligned; > > + > > > +#if HAVE_SSE42 > > +if (cpu_flags & AV_CPU_FLAG_SSE4) { > > movntdqa is an SSE4.1 instruction, so this should work better: > > if (INLINE_SSE4(cpu_flags)) > > That checks both HAVE_SSE4_INLINE and cpu_flags for AV_CPU_FLAG_SSE4. > > (But then like others have said new inline asm code shouldn't be added in the > first place) Next step would be the use of YASM, but I only want to test if the general approach is fine (and if the API is not too specific). Also if someone wants to step up and port it to YASM I'm all for it, since ASM/YASM is far from being my area of expertise. -- FFmpeg = Fiendish Fabulous Most Pure Evangelical God >From ec96aee1930247248a5e438171c120ea3f5dbbea Mon Sep 17 00:00:00 2001 From: Stefano Sabatini Date: Fri, 15 May 2015 18:58:17 +0200 Subject: [PATCH] lavu/imgutils: add av_image_copy_plane_from_uswc() function. This function allows support to optimized GPU to CPU. Based on code from vlc dxva2.c, commit 62107e56 by Laurent Aimar . TODO: fix integration with the build system, update APIchanges and bump minor once ready --- libavutil/imgutils.c | 13 + libavutil/imgutils.h | 18 ++ libavutil/imgutils_internal.h | 29 ++ libavutil/x86/Makefile| 1 + libavutil/x86/imgutils.c | 126 ++ 5 files changed, 187 insertions(+) create mode 100644 libavutil/imgutils_internal.h create mode 100644 libavutil/x86/imgutils.c diff --git a/libavutil/imgutils.c b/libavutil/imgutils.c index ef0e671..59a0054 100644 --- a/libavutil/imgutils.c +++ b/libavutil/imgutils.c @@ -30,6 +30,7 @@ #include "mathematics.h" #include "pixdesc.h" #include "r
Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg
On Fri, May 29, 2015 at 03:49:22PM +0200, Stefano Sabatini wrote: > @@ -405,3 +406,16 @@ int av_image_copy_to_buffer(uint8_t *dst, int dst_size, > > return size; > } > + > +void av_image_copy_plane_from_uswc(uint8_t *dst, size_t dst_linesize, > +const uint8_t *src, size_t src_linesize, > +unsigned bytewidth, unsigned height, > +int cpu_flags) > +{ > +#if !HAVE_SSSE3 > +av_unused(cpu_flags); av_used has a different definition than VLC_UNUSED. Just use a (void) cast. > +av_image_copy_plane(dst, dst_linesize, src, src_linesize, bytewidth, > height); > +#else > +ff_image_copy_plane_from_uswc_x86(dst, dst_linesize, src, src_linesize, > bytewidth, height, cpu_flags); > +#endif > +} > diff --git a/libavutil/imgutils.h b/libavutil/imgutils.h > index 23282a3..184e1e7 100644 > --- a/libavutil/imgutils.h > +++ b/libavutil/imgutils.h > @@ -111,6 +111,24 @@ void av_image_copy_plane(uint8_t *dst, int > dst_linesize, > int bytewidth, int height); > > /** > + * Copy image plane from src to dst, similar to av_image_copy_plane(). > + * src must be an USWC buffer. > + * It performs optimized copy from "Uncacheable Speculative Write > + * Combining" memory as used by some video surface. > + * It is really efficient only when SSE4.1 is available. > + * > + * In case the target CPU does not support USWC caching this function > + * will be equivalent to av_image_copy_plane(). > + * > + * @param cpu_flags as returned by av_get_cpu_flags() > + * @see av_image_copy_plane() > + */ > +void av_image_copy_plane_from_uswc(uint8_t *dst, size_t dst_linesize, > + const uint8_t *src, size_t src_linesize, > + unsigned bytewidth, unsigned height, > + int cpu_flags); > + > +/** > * Copy image in src_data to dst_data. > * > * @param dst_linesizes linesizes for the image in dst_data > diff --git a/libavutil/imgutils_internal.h b/libavutil/imgutils_internal.h > new file mode 100644 > index 000..9576afe > --- /dev/null > +++ b/libavutil/imgutils_internal.h > @@ -0,0 +1,29 @@ > +/* > + * This file is part of FFmpeg. > + * > + * FFmpeg is free software; you can redistribute it and/or > + * modify it under the terms of the GNU Lesser General Public > + * License as published by the Free Software Foundation; either > + * version 2.1 of the License, or (at your option) any later version. > + * > + * FFmpeg is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + * Lesser General Public License for more details. > + * > + * You should have received a copy of the GNU Lesser General Public > + * License along with FFmpeg; if not, write to the Free Software > + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 > USA > + */ > + > +#ifndef AVUTIL_IMGUTILS_INTERNAL_H > +#define AVUTIL_IMGUTILS_INTERNAL_H > + > +#include "imgutils.h" > + > +void ff_image_copy_plane_from_uswc_x86(uint8_t *dst, size_t dst_linesize, > +const uint8_t *src, size_t src_linesize, > +unsigned bytewidth, unsigned height, > +int cpu_flags); > + > +#endif /* AVUTIL_IMGUTILS_INTERNAL_H */ > diff --git a/libavutil/x86/Makefile b/libavutil/x86/Makefile > index eb70a62..a719c00 100644 > --- a/libavutil/x86/Makefile > +++ b/libavutil/x86/Makefile > @@ -1,5 +1,6 @@ > OBJS += x86/cpu.o \ > x86/float_dsp_init.o\ > +x86/imgutils.o \ > x86/lls_init.o \ > > OBJS-$(CONFIG_PIXELUTILS) += x86/pixelutils_init.o \ > diff --git a/libavutil/x86/imgutils.c b/libavutil/x86/imgutils.c > new file mode 100644 > index 000..8b3ed0f > --- /dev/null > +++ b/libavutil/x86/imgutils.c > @@ -0,0 +1,95 @@ > +/* > + * This file is part of FFmpeg. > + * > + * FFmpeg is free software; you can redistribute it and/or > + * modify it under the terms of the GNU Lesser General Public > + * License as published by the Free Software Foundation; either > + * version 2.1 of the License, or (at your option) any later version. > + * > + * FFmpeg is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + * Lesser General Public License for more details. > + * > + * You should have received a copy of the GNU Lesser General Public > + * License along with FFmpeg; if not, write to the Free Software > + * Foundation, Inc., 51
Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg
On date Thursday 2015-05-28 18:02:34 -0300, James Almer encoded: > On 28/05/15 2:39 PM, Stefano Sabatini wrote: > > From f3b4e77dd9dd299aba8f4fa83625d2b61b243c3c Mon Sep 17 00:00:00 2001 > > From: Stefano Sabatini > > Date: Fri, 15 May 2015 18:58:17 +0200 > > Subject: [PATCH] lavu/imgutils: add av_image_copy_plane_from_uswc() > > function. > > > > This function allows support to optimized GPU to CPU. > > > > Based on code from vlc dxva2.c, commit 62107e56 by Laurent Aimar > > . > > > > TODO: fix integration with the build system, bump micro > > > > Signed-off-by: Stefano Sabatini > > --- > > libavutil/imgutils.c | 14 ++ > > libavutil/imgutils.h | 18 +++ > > libavutil/imgutils_internal.h | 29 +++ > > libavutil/x86/Makefile| 1 + > > libavutil/x86/imgutils.c | 109 > > ++ > > 5 files changed, 171 insertions(+) > > create mode 100644 libavutil/imgutils_internal.h > > create mode 100644 libavutil/x86/imgutils.c > > > > diff --git a/libavutil/imgutils.c b/libavutil/imgutils.c > > index ef0e671..e538c75 100644 > > --- a/libavutil/imgutils.c > > +++ b/libavutil/imgutils.c > > @@ -30,6 +30,7 @@ > > #include "mathematics.h" > > #include "pixdesc.h" > > #include "rational.h" > > +#include "imgutils_internal.h" > > > > void av_image_fill_max_pixsteps(int max_pixsteps[4], int > > max_pixstep_comps[4], > > const AVPixFmtDescriptor *pixdesc) > > @@ -405,3 +406,16 @@ int av_image_copy_to_buffer(uint8_t *dst, int dst_size, > > > > return size; > > } > > + > > +void av_image_copy_plane_from_uswc(uint8_t *dst, size_t dst_linesize, > > + const uint8_t *src, size_t src_linesize, > > + unsigned bytewidth, unsigned height, > > + unsigned cpu_flags) > > +{ > > +#ifndef HAVE_SSSE3 > > All HAVE_ are always defined to either 0 or 1. Fixed. > Nonetheless, this kind of check does not belong outside of arch folders. You > should > check for ARCH_X86 to call functions in the x86/ folder. See lavc/lavfi for > examples. I see, but I think this use case is pretty different. We don't have a context where to set a function pointer, and I don't want to add a new context and API for such things (but I'm open to suggestions). A probably slightly ugly alternative could be to define a function such as: get_ff_image_copy_plane_from_uswc_fn() returning a pointer to the correct function. [...] > > diff --git a/libavutil/x86/imgutils.c b/libavutil/x86/imgutils.c > > new file mode 100644 > > index 000..91c7a42 > > --- /dev/null > > +++ b/libavutil/x86/imgutils.c > > @@ -0,0 +1,109 @@ > > +/* > > + * This file is part of FFmpeg. > > + * > > + * FFmpeg is free software; you can redistribute it and/or > > + * modify it under the terms of the GNU Lesser General Public > > + * License as published by the Free Software Foundation; either > > + * version 2.1 of the License, or (at your option) any later version. > > + * > > + * FFmpeg is distributed in the hope that it will be useful, > > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > + * Lesser General Public License for more details. > > + * > > + * You should have received a copy of the GNU Lesser General Public > > + * License along with FFmpeg; if not, write to the Free Software > > + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA > > 02110-1301 USA > > + */ > > + > > +#include > > +#include "config.h" > > +#include "libavutil/attributes.h" > > +#include "libavutil/avassert.h" > > +#include "libavutil/intreadwrite.h" > > +#include "libavutil/x86/asm.h" > > +#include "libavutil/x86/cpu.h" > > +#include "libavutil/cpu.h" > > +#include "libavutil/pixdesc.h" > > + > > +#include "libavutil/avassert.h" > > +#include "libavutil/x86/asm.h" > > +#include "libavutil/imgutils.h" > > +#include "libavutil/imgutils_internal.h" > > + > > +#ifdef HAVE_SSE2 > > +/* Copy 16/64 bytes from srcp to dstp loading data with the SSE>=2 > > instruction > > + * load and storing data with the SSE>=2 instruction store. > > + */ > > +#define COPY16(dstp, srcp, load, store) \ > > +__asm__ volatile ( \ > > +load " 0(%[src]), %%xmm1\n"\ > > +store " %%xmm1,0(%[dst])\n" \ > > +: : [dst]"r"(dstp), [src]"r"(srcp) : "memory", "xmm1") > > + > > +#define COPY64(dstp, srcp, load, store) \ > > +__asm__ volatile ( \ > > +load " 0(%[src]), %%xmm1\n"\ > > +load " 16(%[src]), %%xmm2\n"\ > > +load " 32(%[src]), %%xmm3\n"\ > > +load " 48(%[src]), %%xmm4\n"\ > > +store " %%xmm1,0(%[dst])\n" \ > > +store " %%xmm2, 16(%[dst])\n" \ > > +store " %%xmm3, 32(%[dst])\n" \ > > +store " %%xmm4, 48(%[dst])\n" \ > > +
Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg
On 28/05/15 2:39 PM, Stefano Sabatini wrote: > From f3b4e77dd9dd299aba8f4fa83625d2b61b243c3c Mon Sep 17 00:00:00 2001 > From: Stefano Sabatini > Date: Fri, 15 May 2015 18:58:17 +0200 > Subject: [PATCH] lavu/imgutils: add av_image_copy_plane_from_uswc() function. > > This function allows support to optimized GPU to CPU. > > Based on code from vlc dxva2.c, commit 62107e56 by Laurent Aimar > . > > TODO: fix integration with the build system, bump micro > > Signed-off-by: Stefano Sabatini > --- > libavutil/imgutils.c | 14 ++ > libavutil/imgutils.h | 18 +++ > libavutil/imgutils_internal.h | 29 +++ > libavutil/x86/Makefile| 1 + > libavutil/x86/imgutils.c | 109 > ++ > 5 files changed, 171 insertions(+) > create mode 100644 libavutil/imgutils_internal.h > create mode 100644 libavutil/x86/imgutils.c > > diff --git a/libavutil/imgutils.c b/libavutil/imgutils.c > index ef0e671..e538c75 100644 > --- a/libavutil/imgutils.c > +++ b/libavutil/imgutils.c > @@ -30,6 +30,7 @@ > #include "mathematics.h" > #include "pixdesc.h" > #include "rational.h" > +#include "imgutils_internal.h" > > void av_image_fill_max_pixsteps(int max_pixsteps[4], int > max_pixstep_comps[4], > const AVPixFmtDescriptor *pixdesc) > @@ -405,3 +406,16 @@ int av_image_copy_to_buffer(uint8_t *dst, int dst_size, > > return size; > } > + > +void av_image_copy_plane_from_uswc(uint8_t *dst, size_t dst_linesize, > +const uint8_t *src, size_t src_linesize, > +unsigned bytewidth, unsigned height, > +unsigned cpu_flags) > +{ > +#ifndef HAVE_SSSE3 All HAVE_ are always defined to either 0 or 1. Nonetheless, this kind of check does not belong outside of arch folders. You should check for ARCH_X86 to call functions in the x86/ folder. See lavc/lavfi for examples. > +av_unused(cpu_flags); > +av_image_copy_plane(dst, dst_linesize, src, src_linesize, bytewidth, > height); > +#else > +ff_image_copy_plane_from_uswc_x86(dst, dst_linesize, src, src_linesize, > bytewidth, height, cpu_flags); > +#endif > +} > diff --git a/libavutil/imgutils.h b/libavutil/imgutils.h > index 23282a3..82c3826 100644 > --- a/libavutil/imgutils.h > +++ b/libavutil/imgutils.h > @@ -111,6 +111,24 @@ void av_image_copy_plane(uint8_t *dst, int > dst_linesize, > int bytewidth, int height); > > /** > + * Copy image plane from src to dst, similar to av_image_copy_plane(). > + * src must be an USWC buffer. > + * It performs optimized copy from "Uncacheable Speculative Write > + * Combining" memory as used by some video surface. > + * It is really efficient only when SSE4.1 is available. > + * > + * In case the target CPU does not support USWC caching this function > + * will be equivalent to av_image_copy_plane(). > + * > + * @param cpu_flags as returned by av_get_cpu_flags() > + * @see av_image_copy_plane() > + */ > +void av_image_copy_plane_from_uswc(uint8_t *dst, size_t dst_linesize, > + const uint8_t *src, size_t src_linesize, > + unsigned bytewidth, unsigned height, > + unsigned cpu_flags); > + > +/** > * Copy image in src_data to dst_data. > * > * @param dst_linesizes linesizes for the image in dst_data > diff --git a/libavutil/imgutils_internal.h b/libavutil/imgutils_internal.h > new file mode 100644 > index 000..16ed977 > --- /dev/null > +++ b/libavutil/imgutils_internal.h > @@ -0,0 +1,29 @@ > +/* > + * This file is part of FFmpeg. > + * > + * FFmpeg is free software; you can redistribute it and/or > + * modify it under the terms of the GNU Lesser General Public > + * License as published by the Free Software Foundation; either > + * version 2.1 of the License, or (at your option) any later version. > + * > + * FFmpeg is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + * Lesser General Public License for more details. > + * > + * You should have received a copy of the GNU Lesser General Public > + * License along with FFmpeg; if not, write to the Free Software > + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 > USA > + */ > + > +#ifndef AVUTIL_IMGUTILS_INTERNAL_H > +#define AVUTIL_IMGUTILS_INTERNAL_H > + > +#include "imgutils.h" > + > +void ff_image_copy_plane_from_uswc_x86(uint8_t *dst, size_t dst_linesize, > +const uint8_t *src, size_t src_linesize, > +unsigned bytewidth, unsigned height, > +unsigned cpu_flags); > + > +#endif /* AVUTIL_IMGUTILS_INTERNAL_H */ > diff --git a/libavutil/x86/Makefile b/libavutil/x
Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg
On Thu, May 28, 2015 at 7:39 PM, Stefano Sabatini wrote: > On date Monday 2015-05-18 13:26:56 +0200, Stefano Sabatini encoded: >> On Mon, May 18, 2015 at 1:17 PM, Hendrik Leppkes >> wrote: >> >> > On Mon, May 18, 2015 at 12:37 PM, Stefano Sabatini >> > wrote: >> > >> [...] >> >> > > >> > > I have a first hackish patch, performed some tests and I got some >> > > significant performance gains, on my iCore5 with Intel Graphics HD4000 I >> > > have now the same performance as the software decoder using DXVA2 for >> > > decoding a H.264 1920x1080 video, but using only a single thread. The >> > patch >> > > as is is a hack, since I had to modify the compilation flags to enable >> > > assembly compilation in the ffmpeg_dxva2.c file. I should probably create >> > > an optimized copy function in libavutil, comments are welcome. >> > >> > FWIW, I never saw any benefits from using a small cache over simply >> > copying directly to the destination memory, that could potentially >> > simplify this a bit. >> > >> >> >> > And yeah, its a huge hack, we don't want new inline assembly. >> > >> >> The sanest approach is probably to add a function to libavutil. The >> optimized copy would then be accessible to third-party library users, with >> no assembly hacks involved. > > New patch attached, it's still somehow hackish, please advice if you > consider this approach acceptable. > The general concept is fine, but it should not use inline asm, and someone will want to argue about the name and placement etc... :) ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg
On date Monday 2015-05-18 13:26:56 +0200, Stefano Sabatini encoded: > On Mon, May 18, 2015 at 1:17 PM, Hendrik Leppkes > wrote: > > > On Mon, May 18, 2015 at 12:37 PM, Stefano Sabatini > > wrote: > > > [...] > > > > > > > I have a first hackish patch, performed some tests and I got some > > > significant performance gains, on my iCore5 with Intel Graphics HD4000 I > > > have now the same performance as the software decoder using DXVA2 for > > > decoding a H.264 1920x1080 video, but using only a single thread. The > > patch > > > as is is a hack, since I had to modify the compilation flags to enable > > > assembly compilation in the ffmpeg_dxva2.c file. I should probably create > > > an optimized copy function in libavutil, comments are welcome. > > > > FWIW, I never saw any benefits from using a small cache over simply > > copying directly to the destination memory, that could potentially > > simplify this a bit. > > > > > > And yeah, its a huge hack, we don't want new inline assembly. > > > > The sanest approach is probably to add a function to libavutil. The > optimized copy would then be accessible to third-party library users, with > no assembly hacks involved. New patch attached, it's still somehow hackish, please advice if you consider this approach acceptable. -- FFmpeg = Formidable and Friendly MultiPurpose Explosive Game >From f3b4e77dd9dd299aba8f4fa83625d2b61b243c3c Mon Sep 17 00:00:00 2001 From: Stefano Sabatini Date: Fri, 15 May 2015 18:58:17 +0200 Subject: [PATCH] lavu/imgutils: add av_image_copy_plane_from_uswc() function. This function allows support to optimized GPU to CPU. Based on code from vlc dxva2.c, commit 62107e56 by Laurent Aimar . TODO: fix integration with the build system, bump micro Signed-off-by: Stefano Sabatini --- libavutil/imgutils.c | 14 ++ libavutil/imgutils.h | 18 +++ libavutil/imgutils_internal.h | 29 +++ libavutil/x86/Makefile| 1 + libavutil/x86/imgutils.c | 109 ++ 5 files changed, 171 insertions(+) create mode 100644 libavutil/imgutils_internal.h create mode 100644 libavutil/x86/imgutils.c diff --git a/libavutil/imgutils.c b/libavutil/imgutils.c index ef0e671..e538c75 100644 --- a/libavutil/imgutils.c +++ b/libavutil/imgutils.c @@ -30,6 +30,7 @@ #include "mathematics.h" #include "pixdesc.h" #include "rational.h" +#include "imgutils_internal.h" void av_image_fill_max_pixsteps(int max_pixsteps[4], int max_pixstep_comps[4], const AVPixFmtDescriptor *pixdesc) @@ -405,3 +406,16 @@ int av_image_copy_to_buffer(uint8_t *dst, int dst_size, return size; } + +void av_image_copy_plane_from_uswc(uint8_t *dst, size_t dst_linesize, + const uint8_t *src, size_t src_linesize, + unsigned bytewidth, unsigned height, + unsigned cpu_flags) +{ +#ifndef HAVE_SSSE3 +av_unused(cpu_flags); +av_image_copy_plane(dst, dst_linesize, src, src_linesize, bytewidth, height); +#else +ff_image_copy_plane_from_uswc_x86(dst, dst_linesize, src, src_linesize, bytewidth, height, cpu_flags); +#endif +} diff --git a/libavutil/imgutils.h b/libavutil/imgutils.h index 23282a3..82c3826 100644 --- a/libavutil/imgutils.h +++ b/libavutil/imgutils.h @@ -111,6 +111,24 @@ void av_image_copy_plane(uint8_t *dst, int dst_linesize, int bytewidth, int height); /** + * Copy image plane from src to dst, similar to av_image_copy_plane(). + * src must be an USWC buffer. + * It performs optimized copy from "Uncacheable Speculative Write + * Combining" memory as used by some video surface. + * It is really efficient only when SSE4.1 is available. + * + * In case the target CPU does not support USWC caching this function + * will be equivalent to av_image_copy_plane(). + * + * @param cpu_flags as returned by av_get_cpu_flags() + * @see av_image_copy_plane() + */ +void av_image_copy_plane_from_uswc(uint8_t *dst, size_t dst_linesize, + const uint8_t *src, size_t src_linesize, + unsigned bytewidth, unsigned height, + unsigned cpu_flags); + +/** * Copy image in src_data to dst_data. * * @param dst_linesizes linesizes for the image in dst_data diff --git a/libavutil/imgutils_internal.h b/libavutil/imgutils_internal.h new file mode 100644 index 000..16ed977 --- /dev/null +++ b/libavutil/imgutils_internal.h @@ -0,0 +1,29 @@ +/* + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg
On Mon, May 18, 2015 at 9:41 PM, Reimar Döffinger wrote: > > > On 18.05.2015, at 12:37, Stefano Sabatini wrote: > >> On Thu, May 14, 2015 at 2:52 PM, Stefano Sabatini >> wrote: >> >>> On date Thursday 2015-05-14 13:01:51 +0200, Stefano Sabatini encoded: On date Tuesday 2015-05-12 15:54:17 +0200, Hendrik Leppkes encoded: >>> [...] > One limitation is as the manual said, it needs to be copied from the > GPU to system memory. ffmpeg_dxva2.c does not implement a optimized > copy function for this, it uses plain old memcpy. > Intel introduced a new instruction for this in SSE4, MOVNTDQA, which > is optimized for copying from USWC memory (Uncacheable Speculative > Write Combining) to system memory. Using this may help speed up the > process significantly, and VLC probably uses it. Now the question is, how would be possible to optimize GPU to CPU copy to get an overall performance gain? At least VLC seems able to get better performances when using HW decoding, but I'm not sure it is copying decoded data back to the CPU (indeed it may perform direct rendering). >>> >>> Self-reply: >>> commit 62107e563f979c638f9a5f58cdfd5639d9c63ac7 >>> Author: Laurent Aimar >>> Date: Tue Nov 17 01:09:43 2009 +0100 >>> >>>Improved performance when copying video surface in dxva2. >>> >>> That is, VLC is using optimized GPU->CPU copy when the relevant SSE2 >>> instructions are available. >>> >> >> I have a first hackish patch, performed some tests and I got some >> significant performance gains, on my iCore5 with Intel Graphics HD4000 I >> have now the same performance as the software decoder using DXVA2 for >> decoding a H.264 1920x1080 video, but using only a single thread. The patch >> as is is a hack, since I had to modify the compilation flags to enable >> assembly compilation in the ffmpeg_dxva2.c file. I should probably create >> an optimized copy function in libavutil, comments are welcome. > > What exactly is SSE4 needed for? MOVNTDQA, its specifically designed for just this task. > Both non-temporal movs and prefetches existed before it, so if that is > critical for performance the fallback implementation is bad. A SSE2 implementation may or may not be faster than plain memcpy, that depends on memcpy. In my tests on Windows, a SSE2 implementation was usually not worth it. > However possibly more important: why is a memcpy needed at all? For any further processing, you need the frame data. And trying to use the frame data directly from the locked surfaces for eg. an encoder is very inefficient (possibly random access pattern), so it needs to be copied into normal memory first. - Hendrik ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg
On 18.05.2015, at 12:37, Stefano Sabatini wrote: > On Thu, May 14, 2015 at 2:52 PM, Stefano Sabatini > wrote: > >> On date Thursday 2015-05-14 13:01:51 +0200, Stefano Sabatini encoded: >>> On date Tuesday 2015-05-12 15:54:17 +0200, Hendrik Leppkes encoded: >> [...] One limitation is as the manual said, it needs to be copied from the GPU to system memory. ffmpeg_dxva2.c does not implement a optimized copy function for this, it uses plain old memcpy. Intel introduced a new instruction for this in SSE4, MOVNTDQA, which is optimized for copying from USWC memory (Uncacheable Speculative Write Combining) to system memory. Using this may help speed up the process significantly, and VLC probably uses it. >>> >>> Now the question is, how would be possible to optimize GPU to CPU copy >>> to get an overall performance gain? At least VLC seems able to get >>> better performances when using HW decoding, but I'm not sure it is >>> copying decoded data back to the CPU (indeed it may perform direct >>> rendering). >> >> Self-reply: >> commit 62107e563f979c638f9a5f58cdfd5639d9c63ac7 >> Author: Laurent Aimar >> Date: Tue Nov 17 01:09:43 2009 +0100 >> >>Improved performance when copying video surface in dxva2. >> >> That is, VLC is using optimized GPU->CPU copy when the relevant SSE2 >> instructions are available. >> > > I have a first hackish patch, performed some tests and I got some > significant performance gains, on my iCore5 with Intel Graphics HD4000 I > have now the same performance as the software decoder using DXVA2 for > decoding a H.264 1920x1080 video, but using only a single thread. The patch > as is is a hack, since I had to modify the compilation flags to enable > assembly compilation in the ffmpeg_dxva2.c file. I should probably create > an optimized copy function in libavutil, comments are welcome. What exactly is SSE4 needed for? Both non-temporal movs and prefetches existed before it, so if that is critical for performance the fallback implementation is bad. However possibly more important: why is a memcpy needed at all? ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg
On Mon, May 18, 2015 at 1:17 PM, Hendrik Leppkes wrote: > On Mon, May 18, 2015 at 12:37 PM, Stefano Sabatini > wrote: > [...] > > > > I have a first hackish patch, performed some tests and I got some > > significant performance gains, on my iCore5 with Intel Graphics HD4000 I > > have now the same performance as the software decoder using DXVA2 for > > decoding a H.264 1920x1080 video, but using only a single thread. The > patch > > as is is a hack, since I had to modify the compilation flags to enable > > assembly compilation in the ffmpeg_dxva2.c file. I should probably create > > an optimized copy function in libavutil, comments are welcome. > > FWIW, I never saw any benefits from using a small cache over simply > copying directly to the destination memory, that could potentially > simplify this a bit. > > And yeah, its a huge hack, we don't want new inline assembly. > The sanest approach is probably to add a function to libavutil. The optimized copy would then be accessible to third-party library users, with no assembly hacks involved. ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg
On Mon, May 18, 2015 at 12:37 PM, Stefano Sabatini wrote: > On Thu, May 14, 2015 at 2:52 PM, Stefano Sabatini > wrote: > >> On date Thursday 2015-05-14 13:01:51 +0200, Stefano Sabatini encoded: >> > On date Tuesday 2015-05-12 15:54:17 +0200, Hendrik Leppkes encoded: >> [...] >> > > One limitation is as the manual said, it needs to be copied from the >> > > GPU to system memory. ffmpeg_dxva2.c does not implement a optimized >> > > copy function for this, it uses plain old memcpy. >> > > Intel introduced a new instruction for this in SSE4, MOVNTDQA, which >> > > is optimized for copying from USWC memory (Uncacheable Speculative >> > > Write Combining) to system memory. Using this may help speed up the >> > > process significantly, and VLC probably uses it. >> > >> > Now the question is, how would be possible to optimize GPU to CPU copy >> > to get an overall performance gain? At least VLC seems able to get >> > better performances when using HW decoding, but I'm not sure it is >> > copying decoded data back to the CPU (indeed it may perform direct >> > rendering). >> >> Self-reply: >> commit 62107e563f979c638f9a5f58cdfd5639d9c63ac7 >> Author: Laurent Aimar >> Date: Tue Nov 17 01:09:43 2009 +0100 >> >> Improved performance when copying video surface in dxva2. >> >> That is, VLC is using optimized GPU->CPU copy when the relevant SSE2 >> instructions are available. >> > > I have a first hackish patch, performed some tests and I got some > significant performance gains, on my iCore5 with Intel Graphics HD4000 I > have now the same performance as the software decoder using DXVA2 for > decoding a H.264 1920x1080 video, but using only a single thread. The patch > as is is a hack, since I had to modify the compilation flags to enable > assembly compilation in the ffmpeg_dxva2.c file. I should probably create > an optimized copy function in libavutil, comments are welcome. FWIW, I never saw any benefits from using a small cache over simply copying directly to the destination memory, that could potentially simplify this a bit. And yeah, its a huge hack, we don't want new inline assembly. - Hendrik ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg
On Thu, May 14, 2015 at 2:52 PM, Stefano Sabatini wrote: > On date Thursday 2015-05-14 13:01:51 +0200, Stefano Sabatini encoded: > > On date Tuesday 2015-05-12 15:54:17 +0200, Hendrik Leppkes encoded: > [...] > > > One limitation is as the manual said, it needs to be copied from the > > > GPU to system memory. ffmpeg_dxva2.c does not implement a optimized > > > copy function for this, it uses plain old memcpy. > > > Intel introduced a new instruction for this in SSE4, MOVNTDQA, which > > > is optimized for copying from USWC memory (Uncacheable Speculative > > > Write Combining) to system memory. Using this may help speed up the > > > process significantly, and VLC probably uses it. > > > > Now the question is, how would be possible to optimize GPU to CPU copy > > to get an overall performance gain? At least VLC seems able to get > > better performances when using HW decoding, but I'm not sure it is > > copying decoded data back to the CPU (indeed it may perform direct > > rendering). > > Self-reply: > commit 62107e563f979c638f9a5f58cdfd5639d9c63ac7 > Author: Laurent Aimar > Date: Tue Nov 17 01:09:43 2009 +0100 > > Improved performance when copying video surface in dxva2. > > That is, VLC is using optimized GPU->CPU copy when the relevant SSE2 > instructions are available. > I have a first hackish patch, performed some tests and I got some significant performance gains, on my iCore5 with Intel Graphics HD4000 I have now the same performance as the software decoder using DXVA2 for decoding a H.264 1920x1080 video, but using only a single thread. The patch as is is a hack, since I had to modify the compilation flags to enable assembly compilation in the ffmpeg_dxva2.c file. I should probably create an optimized copy function in libavutil, comments are welcome. The IDirect3D9_CreateDevice(... GetShellWindow ...) -> ..GetDesktopWindow change is required to make it compile under MinGW (with MinGW64 it is probably not required, I still have to switch to MinGW64 but allowing MinGW compilation is still worthwhile). 0001-ffmpeg_dxva.c-add-support-to-optimized-GPU-to-CPU-co.patch Description: Binary data ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg
On Thu, May 14, 2015 at 2:52 PM, Stefano Sabatini wrote: > On date Thursday 2015-05-14 13:01:51 +0200, Stefano Sabatini encoded: >> On date Tuesday 2015-05-12 15:54:17 +0200, Hendrik Leppkes encoded: > [...] >> > One limitation is as the manual said, it needs to be copied from the >> > GPU to system memory. ffmpeg_dxva2.c does not implement a optimized >> > copy function for this, it uses plain old memcpy. >> > Intel introduced a new instruction for this in SSE4, MOVNTDQA, which >> > is optimized for copying from USWC memory (Uncacheable Speculative >> > Write Combining) to system memory. Using this may help speed up the >> > process significantly, and VLC probably uses it. >> >> Now the question is, how would be possible to optimize GPU to CPU copy >> to get an overall performance gain? At least VLC seems able to get >> better performances when using HW decoding, but I'm not sure it is >> copying decoded data back to the CPU (indeed it may perform direct >> rendering). > > Self-reply: > commit 62107e563f979c638f9a5f58cdfd5639d9c63ac7 > Author: Laurent Aimar > Date: Tue Nov 17 01:09:43 2009 +0100 > > Improved performance when copying video surface in dxva2. > > That is, VLC is using optimized GPU->CPU copy when the relevant SSE2 > instructions are available. Actually the real proper instructions are SSE4.1, using SSE2 would only be a small advantage over memcpy. - Hendrik ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg
On Thu, 14 May 2015 14:52:29 +0200 Stefano Sabatini wrote: > On date Thursday 2015-05-14 13:01:51 +0200, Stefano Sabatini encoded: > > On date Tuesday 2015-05-12 15:54:17 +0200, Hendrik Leppkes encoded: > [...] > > > One limitation is as the manual said, it needs to be copied from the > > > GPU to system memory. ffmpeg_dxva2.c does not implement a optimized > > > copy function for this, it uses plain old memcpy. > > > Intel introduced a new instruction for this in SSE4, MOVNTDQA, which > > > is optimized for copying from USWC memory (Uncacheable Speculative > > > Write Combining) to system memory. Using this may help speed up the > > > process significantly, and VLC probably uses it. > > > > Now the question is, how would be possible to optimize GPU to CPU copy > > to get an overall performance gain? At least VLC seems able to get > > better performances when using HW decoding, but I'm not sure it is > > copying decoded data back to the CPU (indeed it may perform direct > > rendering). > > Self-reply: > commit 62107e563f979c638f9a5f58cdfd5639d9c63ac7 > Author: Laurent Aimar > Date: Tue Nov 17 01:09:43 2009 +0100 > > Improved performance when copying video surface in dxva2. > > That is, VLC is using optimized GPU->CPU copy when the relevant SSE2 > instructions are available. Here's what lavfilters appears to use: http://git.1f0.de/gitweb?p=lavfsplitter.git;a=blob;f=common/DSUtilLite/gpu_memcpy_sse4.h ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg
On date Thursday 2015-05-14 13:01:51 +0200, Stefano Sabatini encoded: > On date Tuesday 2015-05-12 15:54:17 +0200, Hendrik Leppkes encoded: [...] > > One limitation is as the manual said, it needs to be copied from the > > GPU to system memory. ffmpeg_dxva2.c does not implement a optimized > > copy function for this, it uses plain old memcpy. > > Intel introduced a new instruction for this in SSE4, MOVNTDQA, which > > is optimized for copying from USWC memory (Uncacheable Speculative > > Write Combining) to system memory. Using this may help speed up the > > process significantly, and VLC probably uses it. > > Now the question is, how would be possible to optimize GPU to CPU copy > to get an overall performance gain? At least VLC seems able to get > better performances when using HW decoding, but I'm not sure it is > copying decoded data back to the CPU (indeed it may perform direct > rendering). Self-reply: commit 62107e563f979c638f9a5f58cdfd5639d9c63ac7 Author: Laurent Aimar Date: Tue Nov 17 01:09:43 2009 +0100 Improved performance when copying video surface in dxva2. That is, VLC is using optimized GPU->CPU copy when the relevant SSE2 instructions are available. -- FFmpeg = Fundamental & Frightening Mean Peaceful EniGma ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg
On date Tuesday 2015-05-12 15:54:17 +0200, Hendrik Leppkes encoded: > On Tue, May 12, 2015 at 3:33 PM, Stefano Sabatini wrote: [...] > > There are some cases when DXVA2 (or in general HW decoding) can be > > used effectively in ffmpeg? Can you tell if there is something which > > could be improved in the current ffmpeg_dxva2.c implementation? (My > > guess is that this code is somehow based on the VLC code). > > Its not based on the VLC code, its roughly based on code from my own > project that uses ffmpeg for DXVA2, but really, the workflow is going > to be pretty similar in any implementation either way, since the MS > API dictates that, more or less. > > DXVA2 decoding can be faster then software decoding, depending on your > hardware. > > If you used a low-end Intel CPU, say a Pentium or i3 (Ivy or Haswell), > or use a recent NVIDIA GPU (Kepler or Maxwell), then DXVA2 decoding on > the GPU can potentially give you ~400 fps for 1080p, while the CPU > will likely not manage that. > On a high-end CPU, the software decoder can potentially exceed that, however. > > One limitation is as the manual said, it needs to be copied from the > GPU to system memory. ffmpeg_dxva2.c does not implement a optimized > copy function for this, it uses plain old memcpy. > Intel introduced a new instruction for this in SSE4, MOVNTDQA, which > is optimized for copying from USWC memory (Uncacheable Speculative > Write Combining) to system memory. Using this may help speed up the > process significantly, and VLC probably uses it. Now the question is, how would be possible to optimize GPU to CPU copy to get an overall performance gain? At least VLC seems able to get better performances when using HW decoding, but I'm not sure it is copying decoded data back to the CPU (indeed it may perform direct rendering). > The original primary goal of this code was however to be able to test > and debug the hwaccels much easier, and not directly to provide a > playback/transcoding feature, so such optimizations were not performed > for brevity. [...] Thanks. -- FFmpeg = Fanciful & Faithless Merciless Powerful EntanGlement ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg
On Tue, May 12, 2015 at 3:33 PM, Stefano Sabatini wrote: > Hi guys, > > I'm playing with DXVA2 hardware decoding on Windows, and these are my > findings. > > DVXA2 decoding was enabled in avconv/ffmpeg through the commit: > > commit 35177ba77ff60a8b8839783f57e44bcc4214507a > Author: Hendrik Leppkes > Date: Tue Apr 22 15:22:53 2014 +0200 > > avconv: add support for DXVA2 decoding > > Signed-off-by: Anton Khirnov > > DXVA2 decoding is enabled when a dxva2api.h header is found in the > path. From my understanding the header is provided by VLC: > http://download.videolan.org/pub/contrib/dxva2api.h > > (I suppose the header was created in order to make compilation work > with MinGW). When compiling with MinGW from mingw.org I had to change > the GetShellWindow call in the line: > > hr = IDirect3D9_CreateDevice(ctx->d3d9, adapter, D3DDEVTYPE_HAL, > GetShellWindow(), > D3DCREATE_SOFTWARE_VERTEXPROCESSING | > D3DCREATE_MULTITHREADED | D3DCREATE_FPU_PRESERVE, > &d3dpp, &ctx->d3d9device); > > to GetDesktopWindow in the ffmpeg_dxva2.c file. I applied the fix > suggested here: > http://ffmpeg.org/pipermail/libav-user/2014-December/007673.html You should use mingw-w64, it provides both a dxva2api.h and can compile the code without any modifications. Using the "original" mingw32 is not recommended, and barely supported. > > Then I performed some tests with the command: > ffmpeg -hwaccel dxva2 INPUT -threads 1 -f null - > > The -threads 1 option seems required or ffmpeg will fail with decoding > errors. Indeed, multi-threading with hwaccel is not something that should be used, as it will break, although the API allows it for BS reasons. There wouldn't be a performance improvement either way. > > In the ffmpeg(1) manual I can read this big warning: > Note that most acceleration methods are intended for playback and > will not be faster than software decoding on modern > CPUs. Additionally, ffmpeg will usually need to copy the decoded > frames from the GPU memory into the system memory, resulting in > further performance loss. This option is thus mainly useful for > testing. > > I tested with several HW combinations, and I always find that pure > software decoding is always several time faster than DXVA2 > decoding. In some cases I got invalid output (same with VLC) which may > be related to a problem in the graphics card or driver (a VIA VX900). I don't think I've ever tested on such a chip. I didn't even know VIA still made PC hardware. Therefor,I have no idea how fast/slow or compatible it is. > > On the other hand when testing with VLC I noticed better performances > (in general, a significantly reduced usage of the CPU, usually of an > order of 3), so I have to conclude that at least VLC is able to make > good use of DXVA2 hardware acceleration. > > I'm aware that the need to copy GPU data back to the CPU memory as > required by ffmpeg defeats the advantage (if any) of hardware > decoding, especially given that multithreading decoding cannot be > adopted with DXVA2. > > My questions are: > > There are some cases when DXVA2 (or in general HW decoding) can be > used effectively in ffmpeg? Can you tell if there is something which > could be improved in the current ffmpeg_dxva2.c implementation? (My > guess is that this code is somehow based on the VLC code). Its not based on the VLC code, its roughly based on code from my own project that uses ffmpeg for DXVA2, but really, the workflow is going to be pretty similar in any implementation either way, since the MS API dictates that, more or less. DXVA2 decoding can be faster then software decoding, depending on your hardware. If you used a low-end Intel CPU, say a Pentium or i3 (Ivy or Haswell), or use a recent NVIDIA GPU (Kepler or Maxwell), then DXVA2 decoding on the GPU can potentially give you ~400 fps for 1080p, while the CPU will likely not manage that. On a high-end CPU, the software decoder can potentially exceed that, however. One limitation is as the manual said, it needs to be copied from the GPU to system memory. ffmpeg_dxva2.c does not implement a optimized copy function for this, it uses plain old memcpy. Intel introduced a new instruction for this in SSE4, MOVNTDQA, which is optimized for copying from USWC memory (Uncacheable Speculative Write Combining) to system memory. Using this may help speed up the process significantly, and VLC probably uses it. The original primary goal of this code was however to be able to test and debug the hwaccels much easier, and not directly to provide a playback/transcoding feature, so such optimizations were not performed for brevity. - Hendrik ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [RFC] DXVA2 decoding and FFmpeg
Hi guys, I'm playing with DXVA2 hardware decoding on Windows, and these are my findings. DVXA2 decoding was enabled in avconv/ffmpeg through the commit: commit 35177ba77ff60a8b8839783f57e44bcc4214507a Author: Hendrik Leppkes Date: Tue Apr 22 15:22:53 2014 +0200 avconv: add support for DXVA2 decoding Signed-off-by: Anton Khirnov DXVA2 decoding is enabled when a dxva2api.h header is found in the path. From my understanding the header is provided by VLC: http://download.videolan.org/pub/contrib/dxva2api.h (I suppose the header was created in order to make compilation work with MinGW). When compiling with MinGW from mingw.org I had to change the GetShellWindow call in the line: hr = IDirect3D9_CreateDevice(ctx->d3d9, adapter, D3DDEVTYPE_HAL, GetShellWindow(), D3DCREATE_SOFTWARE_VERTEXPROCESSING | D3DCREATE_MULTITHREADED | D3DCREATE_FPU_PRESERVE, &d3dpp, &ctx->d3d9device); to GetDesktopWindow in the ffmpeg_dxva2.c file. I applied the fix suggested here: http://ffmpeg.org/pipermail/libav-user/2014-December/007673.html Then I performed some tests with the command: ffmpeg -hwaccel dxva2 INPUT -threads 1 -f null - The -threads 1 option seems required or ffmpeg will fail with decoding errors. In the ffmpeg(1) manual I can read this big warning: Note that most acceleration methods are intended for playback and will not be faster than software decoding on modern CPUs. Additionally, ffmpeg will usually need to copy the decoded frames from the GPU memory into the system memory, resulting in further performance loss. This option is thus mainly useful for testing. I tested with several HW combinations, and I always find that pure software decoding is always several time faster than DXVA2 decoding. In some cases I got invalid output (same with VLC) which may be related to a problem in the graphics card or driver (a VIA VX900). On the other hand when testing with VLC I noticed better performances (in general, a significantly reduced usage of the CPU, usually of an order of 3), so I have to conclude that at least VLC is able to make good use of DXVA2 hardware acceleration. I'm aware that the need to copy GPU data back to the CPU memory as required by ffmpeg defeats the advantage (if any) of hardware decoding, especially given that multithreading decoding cannot be adopted with DXVA2. My questions are: There are some cases when DXVA2 (or in general HW decoding) can be used effectively in ffmpeg? Can you tell if there is something which could be improved in the current ffmpeg_dxva2.c implementation? (My guess is that this code is somehow based on the VLC code). Would it make sense to integrate DXVA2 decoding in ffplay.c, assuming it would be worth the effort, at least for testing/didactic purposes? Related resources: https://trac.ffmpeg.org/ticket/604 https://ffmpeg.org/pipermail/ffmpeg-user/2012-May/006600.html http://forum.doom9.org/showthread.php?t=170793 TIA for any comments. -- FFmpeg = Fostering and Fantastic Maxi Picky Erudite God ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel