On Fri, Jan 8, 2016 at 6:52 AM, Michael Niedermayer <mich...@niedermayer.cc> wrote: > On Thu, Jan 07, 2016 at 05:20:55PM -0800, Ganesh Ajjanagadde wrote: >> On Thu, Jan 7, 2016 at 4:48 PM, Michael Niedermayer >> <mich...@niedermayer.cc> wrote: >> > On Mon, Jan 04, 2016 at 06:33:59PM -0800, Ganesh Ajjanagadde wrote: >> >> This exploits an approach based on the sieve of Eratosthenes, a popular >> >> method for generating prime numbers. >> >> >> >> Tables are identical to previous ones. >> >> >> >> Tested with FATE with/without --enable-hardcoded-tables. >> >> >> >> Sample benchmark (Haswell, GNU/Linux+gcc): >> >> prev: >> >> 7860100 decicycles in cbrt_tableinit, 1 runs, 0 skips >> >> 7777490 decicycles in cbrt_tableinit, 2 runs, 0 skips >> >> [...] >> >> 7582339 decicycles in cbrt_tableinit, 256 runs, 0 skips >> >> 7563556 decicycles in cbrt_tableinit, 512 runs, 0 skips >> >> >> >> new: >> >> 2099480 decicycles in cbrt_tableinit, 1 runs, 0 skips >> >> 2044470 decicycles in cbrt_tableinit, 2 runs, 0 skips >> >> [...] >> >> 1796544 decicycles in cbrt_tableinit, 256 runs, 0 skips >> >> 1791631 decicycles in cbrt_tableinit, 512 runs, 0 skips >> >> >> >> Both small and large run count given as this is called once so small run >> >> count may give a better picture, small numbers are fairly consistent, >> >> and there is a consistent downward trend from small to large runs, >> >> at which point it stabilizes to a new value. >> >> >> >> Signed-off-by: Ganesh Ajjanagadde <gajjanaga...@gmail.com> >> >> --- >> >> libavcodec/aacdec_fixed.c | 4 +-- >> >> libavcodec/aacdec_template.c | 2 +- >> >> libavcodec/cbrt_tablegen.h | 53 >> >> ++++++++++++++++++++++++++----------- >> >> libavcodec/cbrt_tablegen_template.c | 12 ++++++++- >> >> 4 files changed, 51 insertions(+), 20 deletions(-) >> >> >> >> diff --git a/libavcodec/aacdec_fixed.c b/libavcodec/aacdec_fixed.c >> >> index 396a874..f7b882b 100644 >> >> --- a/libavcodec/aacdec_fixed.c >> >> +++ b/libavcodec/aacdec_fixed.c >> >> @@ -155,9 +155,9 @@ static void vector_pow43(int *coefs, int len) >> >> for (i=0; i<len; i++) { >> >> coef = coefs[i]; >> >> if (coef < 0) >> >> - coef = -(int)cbrt_tab[-coef]; >> >> + coef = -(int)cbrt_tab[-coef].i; >> >> else >> >> - coef = (int)cbrt_tab[coef]; >> >> + coef = (int)cbrt_tab[coef].i; >> >> coefs[i] = coef; >> >> } >> >> } >> >> diff --git a/libavcodec/aacdec_template.c b/libavcodec/aacdec_template.c >> >> index d819958..1380510 100644 >> >> --- a/libavcodec/aacdec_template.c >> >> +++ b/libavcodec/aacdec_template.c >> >> @@ -1791,7 +1791,7 @@ static int decode_spectrum_and_dequant(AACContext >> >> *ac, INTFLOAT coef[1024], >> >> v = -v; >> >> *icf++ = v; >> >> #else >> >> - *icf++ = cbrt_tab[n] | (bits & >> >> 1U<<31); >> >> + *icf++ = cbrt_tab[n].i | (bits & >> >> 1U<<31); >> >> #endif /* USE_FIXED */ >> >> bits <<= 1; >> >> } else { >> >> diff --git a/libavcodec/cbrt_tablegen.h b/libavcodec/cbrt_tablegen.h >> >> index 59b5a1d..e3d6634 100644 >> >> --- a/libavcodec/cbrt_tablegen.h >> >> +++ b/libavcodec/cbrt_tablegen.h >> >> @@ -26,14 +26,13 @@ >> >> #include <stdint.h> >> >> #include <math.h> >> >> #include "libavutil/attributes.h" >> >> +#include "libavutil/intfloat.h" >> >> #include "libavcodec/aac_defines.h" >> >> >> >> -#if USE_FIXED >> >> -#define CBRT(x) lrint((x).f * 8192) >> >> -#else >> >> -#define CBRT(x) x.i >> >> -#endif >> >> - >> > >> >> +union ff_int32float64 { >> >> + uint32_t i; >> >> + double f; >> >> +}; >> >> #if CONFIG_HARDCODED_TABLES >> >> #if USE_FIXED >> >> #define cbrt_tableinit_fixed() >> >> @@ -43,20 +42,42 @@ >> >> #include "libavcodec/cbrt_tables.h" >> >> #endif >> >> #else >> >> -static uint32_t cbrt_tab[1 << 13]; >> >> +static union ff_int32float64 cbrt_tab[1 << 13]; >> > >> > this doubles the size of the cpu cache needed at runtime to store >> > the same number of elements >> >> Yes, it does, and it was a tradeoff I made that I forgot to list. One >> can of course use floats; but this loses accuracy at significant >> levels. >> >> So one could malloc and free a double precision array (for temporary >> storage) at costs of some code complexity, possible heap >> fragmentation, and the problem of possible failure (may be ok since >> anyway aac_decode_init is not guaranteed to succeed; it allocates >> memory for the dsp context). Malloc/free is AFAIK ~ 100's of cycles, >> dwarfed by the table generation cost. >> >> The problem is that it is impossible to give an answer as to precisely >> what impact that will have on decoding/encoding performance, and >> results of course vary based on hardware. This is the same problem >> that plagues static/dynamic table performance analysis. >> >> I don't have a measurable performance regression on my machine for aac >> decoding because of this. But then, my Haswell setup is not exactly >> representative. > > you can use 2 seperate arrays without union or maybe make the arrays > part of the union instead of the array elements
Chose the first for lower code complexity; this is what I meant by a static double array. Pushed, thanks. > > [...] > > -- > Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB > > He who knows, does not speak. He who speaks, does not know. -- Lao Tsu > > _______________________________________________ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > http://ffmpeg.org/mailman/listinfo/ffmpeg-devel > _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel