On Thu, Jan 07, 2016 at 05:20:55PM -0800, Ganesh Ajjanagadde wrote: > On Thu, Jan 7, 2016 at 4:48 PM, Michael Niedermayer > <mich...@niedermayer.cc> wrote: > > On Mon, Jan 04, 2016 at 06:33:59PM -0800, Ganesh Ajjanagadde wrote: > >> This exploits an approach based on the sieve of Eratosthenes, a popular > >> method for generating prime numbers. > >> > >> Tables are identical to previous ones. > >> > >> Tested with FATE with/without --enable-hardcoded-tables. > >> > >> Sample benchmark (Haswell, GNU/Linux+gcc): > >> prev: > >> 7860100 decicycles in cbrt_tableinit, 1 runs, 0 skips > >> 7777490 decicycles in cbrt_tableinit, 2 runs, 0 skips > >> [...] > >> 7582339 decicycles in cbrt_tableinit, 256 runs, 0 skips > >> 7563556 decicycles in cbrt_tableinit, 512 runs, 0 skips > >> > >> new: > >> 2099480 decicycles in cbrt_tableinit, 1 runs, 0 skips > >> 2044470 decicycles in cbrt_tableinit, 2 runs, 0 skips > >> [...] > >> 1796544 decicycles in cbrt_tableinit, 256 runs, 0 skips > >> 1791631 decicycles in cbrt_tableinit, 512 runs, 0 skips > >> > >> Both small and large run count given as this is called once so small run > >> count may give a better picture, small numbers are fairly consistent, > >> and there is a consistent downward trend from small to large runs, > >> at which point it stabilizes to a new value. > >> > >> Signed-off-by: Ganesh Ajjanagadde <gajjanaga...@gmail.com> > >> --- > >> libavcodec/aacdec_fixed.c | 4 +-- > >> libavcodec/aacdec_template.c | 2 +- > >> libavcodec/cbrt_tablegen.h | 53 > >> ++++++++++++++++++++++++++----------- > >> libavcodec/cbrt_tablegen_template.c | 12 ++++++++- > >> 4 files changed, 51 insertions(+), 20 deletions(-) > >> > >> diff --git a/libavcodec/aacdec_fixed.c b/libavcodec/aacdec_fixed.c > >> index 396a874..f7b882b 100644 > >> --- a/libavcodec/aacdec_fixed.c > >> +++ b/libavcodec/aacdec_fixed.c > >> @@ -155,9 +155,9 @@ static void vector_pow43(int *coefs, int len) > >> for (i=0; i<len; i++) { > >> coef = coefs[i]; > >> if (coef < 0) > >> - coef = -(int)cbrt_tab[-coef]; > >> + coef = -(int)cbrt_tab[-coef].i; > >> else > >> - coef = (int)cbrt_tab[coef]; > >> + coef = (int)cbrt_tab[coef].i; > >> coefs[i] = coef; > >> } > >> } > >> diff --git a/libavcodec/aacdec_template.c b/libavcodec/aacdec_template.c > >> index d819958..1380510 100644 > >> --- a/libavcodec/aacdec_template.c > >> +++ b/libavcodec/aacdec_template.c > >> @@ -1791,7 +1791,7 @@ static int decode_spectrum_and_dequant(AACContext > >> *ac, INTFLOAT coef[1024], > >> v = -v; > >> *icf++ = v; > >> #else > >> - *icf++ = cbrt_tab[n] | (bits & > >> 1U<<31); > >> + *icf++ = cbrt_tab[n].i | (bits & > >> 1U<<31); > >> #endif /* USE_FIXED */ > >> bits <<= 1; > >> } else { > >> diff --git a/libavcodec/cbrt_tablegen.h b/libavcodec/cbrt_tablegen.h > >> index 59b5a1d..e3d6634 100644 > >> --- a/libavcodec/cbrt_tablegen.h > >> +++ b/libavcodec/cbrt_tablegen.h > >> @@ -26,14 +26,13 @@ > >> #include <stdint.h> > >> #include <math.h> > >> #include "libavutil/attributes.h" > >> +#include "libavutil/intfloat.h" > >> #include "libavcodec/aac_defines.h" > >> > >> -#if USE_FIXED > >> -#define CBRT(x) lrint((x).f * 8192) > >> -#else > >> -#define CBRT(x) x.i > >> -#endif > >> - > > > >> +union ff_int32float64 { > >> + uint32_t i; > >> + double f; > >> +}; > >> #if CONFIG_HARDCODED_TABLES > >> #if USE_FIXED > >> #define cbrt_tableinit_fixed() > >> @@ -43,20 +42,42 @@ > >> #include "libavcodec/cbrt_tables.h" > >> #endif > >> #else > >> -static uint32_t cbrt_tab[1 << 13]; > >> +static union ff_int32float64 cbrt_tab[1 << 13]; > > > > this doubles the size of the cpu cache needed at runtime to store > > the same number of elements > > Yes, it does, and it was a tradeoff I made that I forgot to list. One > can of course use floats; but this loses accuracy at significant > levels. > > So one could malloc and free a double precision array (for temporary > storage) at costs of some code complexity, possible heap > fragmentation, and the problem of possible failure (may be ok since > anyway aac_decode_init is not guaranteed to succeed; it allocates > memory for the dsp context). Malloc/free is AFAIK ~ 100's of cycles, > dwarfed by the table generation cost. > > The problem is that it is impossible to give an answer as to precisely > what impact that will have on decoding/encoding performance, and > results of course vary based on hardware. This is the same problem > that plagues static/dynamic table performance analysis. > > I don't have a measurable performance regression on my machine for aac > decoding because of this. But then, my Haswell setup is not exactly > representative.
you can use 2 seperate arrays without union or maybe make the arrays part of the union instead of the array elements [...] -- Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB He who knows, does not speak. He who speaks, does not know. -- Lao Tsu
signature.asc
Description: Digital signature
_______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel