Re: [FFmpeg-devel] [PATCH] avcodec/nellymoserenc: avoid wasteful pow
One minor nitpick about commit message. You could mention which compiler was used to generate code for benchmark. For example Clang 3.7 replaces pow(2,...) with exp2(...) call by itself. So you probably did use gcc. Anyway since it is already merged I guess take my reply as a hint for next time :) Regards, Kacper 17 gru 2015 5:14 PM "Ganesh Ajjanagadde"napisał(a): > On Tue, Dec 15, 2015 at 6:40 PM, Ganesh Ajjanagadde > wrote: > > On Tue, Dec 15, 2015 at 5:25 PM, Ganesh Ajjanagadde > wrote: > >> On Tue, Dec 15, 2015 at 2:23 AM, Michael Niedermayer > wrote: > >>> On Wed, Dec 09, 2015 at 06:55:25PM -0500, Ganesh Ajjanagadde wrote: > > [...] > > diff --git a/libavcodec/nellymoserenc.c b/libavcodec/nellymoserenc.c > index d998dba..e6023e3 100644 > --- a/libavcodec/nellymoserenc.c > +++ b/libavcodec/nellymoserenc.c > @@ -179,8 +179,15 @@ static av_cold int encode_init(AVCodecContext > *avctx) > > /* Generate overlap window */ > ff_init_ff_sine_windows(7); > -for (i = 0; i < POW_TABLE_SIZE; i++) > -pow_table[i] = pow(2, -i / 2048.0 - 3.0 + POW_TABLE_OFFSET); > +pow_table[0] = 1; > +pow_table[1024] = M_SQRT1_2; > +for (i = 1; i < 513; i++) { > +double tmp = exp2(-i / 2048.0); > +pow_table[i] = tmp; > +pow_table[1024-i] = M_SQRT1_2 / tmp; > +pow_table[1024+i] = tmp * M_SQRT1_2; > +pow_table[2048-i] = 0.5 / tmp; > >>> > >>> how much overall init time is gained by this ? > >>> that is time in ffmpeg main() from start to finish when just opening > >>> the file with no decoding aka ./ffmpeg -i somefile > >> > >> Don't know, all I know is cycles are unnecessarily wasted. Will put in > >> cycle numbers. > >> > > > > Here they are: > > proposed: 424160 decicycles in pow_table, 512 runs, 0 skips > > exp2 only: 1262093 decicycles in pow_table, 512 runs, 0 skips > > old: 2849085 decicycles in pow_table, 512 runs, 0 skips > > > > Thus old to exp2 is roughly 2.25x speedup, exp2 to proposed roughly 3x > > speedup, net ~ 6.7x speedup. > > took Michael's comment as a general ack, so pushed with addition of a > comment and cycle numbers. > ___ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > http://ffmpeg.org/mailman/listinfo/ffmpeg-devel > ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] avcodec/nellymoserenc: avoid wasteful pow
18 gru 2015 10:06 AM "Kacper Michajlow"napisał(a): > > One minor nitpick about commit message. You could mention which compiler was used to generate code for benchmark. For example Clang 3.7 replaces pow(2,...) with exp2(...) call by itself. So you probably did use gcc. Anyway since it is already merged I guess take my reply as a hint for next time :) > > Regards, > Kacper > > 17 gru 2015 5:14 PM "Ganesh Ajjanagadde" napisał(a): >> >> On Tue, Dec 15, 2015 at 6:40 PM, Ganesh Ajjanagadde wrote: >> > On Tue, Dec 15, 2015 at 5:25 PM, Ganesh Ajjanagadde wrote: >> >> On Tue, Dec 15, 2015 at 2:23 AM, Michael Niedermayer wrote: >> >>> On Wed, Dec 09, 2015 at 06:55:25PM -0500, Ganesh Ajjanagadde wrote: >> > [...] >> >> diff --git a/libavcodec/nellymoserenc.c b/libavcodec/nellymoserenc.c >> index d998dba..e6023e3 100644 >> --- a/libavcodec/nellymoserenc.c >> +++ b/libavcodec/nellymoserenc.c >> @@ -179,8 +179,15 @@ static av_cold int encode_init(AVCodecContext *avctx) >> >> /* Generate overlap window */ >> ff_init_ff_sine_windows(7); >> -for (i = 0; i < POW_TABLE_SIZE; i++) >> -pow_table[i] = pow(2, -i / 2048.0 - 3.0 + POW_TABLE_OFFSET); >> +pow_table[0] = 1; >> +pow_table[1024] = M_SQRT1_2; >> +for (i = 1; i < 513; i++) { >> +double tmp = exp2(-i / 2048.0); >> +pow_table[i] = tmp; >> +pow_table[1024-i] = M_SQRT1_2 / tmp; >> +pow_table[1024+i] = tmp * M_SQRT1_2; >> +pow_table[2048-i] = 0.5 / tmp; >> >>> >> >>> how much overall init time is gained by this ? >> >>> that is time in ffmpeg main() from start to finish when just opening >> >>> the file with no decoding aka ./ffmpeg -i somefile >> >> >> >> Don't know, all I know is cycles are unnecessarily wasted. Will put in >> >> cycle numbers. >> >> >> > >> > Here they are: >> > proposed: 424160 decicycles in pow_table, 512 runs, 0 skips >> > exp2 only: 1262093 decicycles in pow_table, 512 runs, 0 skips >> > old: 2849085 decicycles in pow_table, 512 runs, 0 skips >> > >> > Thus old to exp2 is roughly 2.25x speedup, exp2 to proposed roughly 3x >> > speedup, net ~ 6.7x speedup. >> >> took Michael's comment as a general ack, so pushed with addition of a >> comment and cycle numbers. >> ___ >> ffmpeg-devel mailing list >> ffmpeg-devel@ffmpeg.org >> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel Sorry for top post. -Kacper ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] avcodec/nellymoserenc: avoid wasteful pow
On Fri, Dec 18, 2015 at 1:11 AM, Kacper Michajlowwrote: > 18 gru 2015 10:06 AM "Kacper Michajlow" napisał(a): >> >> One minor nitpick about commit message. You could mention which compiler > was used to generate code for benchmark. For example Clang 3.7 replaces > pow(2,...) with exp2(...) call by itself. So you probably did use gcc. > Anyway since it is already merged I guess take my reply as a hint for next > time :) Thanks: yes, I have been sloppy about this. >> >> Regards, >> Kacper >> >> 17 gru 2015 5:14 PM "Ganesh Ajjanagadde" napisał(a): >>> >>> On Tue, Dec 15, 2015 at 6:40 PM, Ganesh Ajjanagadde > wrote: >>> > On Tue, Dec 15, 2015 at 5:25 PM, Ganesh Ajjanagadde > wrote: >>> >> On Tue, Dec 15, 2015 at 2:23 AM, Michael Niedermayer > wrote: >>> >>> On Wed, Dec 09, 2015 at 06:55:25PM -0500, Ganesh Ajjanagadde wrote: >>> > [...] >>> >>> diff --git a/libavcodec/nellymoserenc.c b/libavcodec/nellymoserenc.c >>> index d998dba..e6023e3 100644 >>> --- a/libavcodec/nellymoserenc.c >>> +++ b/libavcodec/nellymoserenc.c >>> @@ -179,8 +179,15 @@ static av_cold int encode_init(AVCodecContext > *avctx) >>> >>> /* Generate overlap window */ >>> ff_init_ff_sine_windows(7); >>> -for (i = 0; i < POW_TABLE_SIZE; i++) >>> -pow_table[i] = pow(2, -i / 2048.0 - 3.0 + > POW_TABLE_OFFSET); >>> +pow_table[0] = 1; >>> +pow_table[1024] = M_SQRT1_2; >>> +for (i = 1; i < 513; i++) { >>> +double tmp = exp2(-i / 2048.0); >>> +pow_table[i] = tmp; >>> +pow_table[1024-i] = M_SQRT1_2 / tmp; >>> +pow_table[1024+i] = tmp * M_SQRT1_2; >>> +pow_table[2048-i] = 0.5 / tmp; >>> >>> >>> >>> how much overall init time is gained by this ? >>> >>> that is time in ffmpeg main() from start to finish when just opening >>> >>> the file with no decoding aka ./ffmpeg -i somefile >>> >> >>> >> Don't know, all I know is cycles are unnecessarily wasted. Will put in >>> >> cycle numbers. >>> >> >>> > >>> > Here they are: >>> > proposed: 424160 decicycles in pow_table, 512 runs, 0 skips >>> > exp2 only: 1262093 decicycles in pow_table, 512 runs, 0 skips >>> > old: 2849085 decicycles in pow_table, 512 runs, 0 skips >>> > >>> > Thus old to exp2 is roughly 2.25x speedup, exp2 to proposed roughly 3x >>> > speedup, net ~ 6.7x speedup. >>> >>> took Michael's comment as a general ack, so pushed with addition of a >>> comment and cycle numbers. >>> ___ >>> ffmpeg-devel mailing list >>> ffmpeg-devel@ffmpeg.org >>> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > Sorry for top post. No problem. > > -Kacper > ___ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > http://ffmpeg.org/mailman/listinfo/ffmpeg-devel ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] avcodec/nellymoserenc: avoid wasteful pow
On Tue, Dec 15, 2015 at 6:40 PM, Ganesh Ajjanagaddewrote: > On Tue, Dec 15, 2015 at 5:25 PM, Ganesh Ajjanagadde wrote: >> On Tue, Dec 15, 2015 at 2:23 AM, Michael Niedermayer >> wrote: >>> On Wed, Dec 09, 2015 at 06:55:25PM -0500, Ganesh Ajjanagadde wrote: > [...] diff --git a/libavcodec/nellymoserenc.c b/libavcodec/nellymoserenc.c index d998dba..e6023e3 100644 --- a/libavcodec/nellymoserenc.c +++ b/libavcodec/nellymoserenc.c @@ -179,8 +179,15 @@ static av_cold int encode_init(AVCodecContext *avctx) /* Generate overlap window */ ff_init_ff_sine_windows(7); -for (i = 0; i < POW_TABLE_SIZE; i++) -pow_table[i] = pow(2, -i / 2048.0 - 3.0 + POW_TABLE_OFFSET); +pow_table[0] = 1; +pow_table[1024] = M_SQRT1_2; +for (i = 1; i < 513; i++) { +double tmp = exp2(-i / 2048.0); +pow_table[i] = tmp; +pow_table[1024-i] = M_SQRT1_2 / tmp; +pow_table[1024+i] = tmp * M_SQRT1_2; +pow_table[2048-i] = 0.5 / tmp; >>> >>> how much overall init time is gained by this ? >>> that is time in ffmpeg main() from start to finish when just opening >>> the file with no decoding aka ./ffmpeg -i somefile >> >> Don't know, all I know is cycles are unnecessarily wasted. Will put in >> cycle numbers. >> > > Here they are: > proposed: 424160 decicycles in pow_table, 512 runs, 0 skips > exp2 only: 1262093 decicycles in pow_table, 512 runs, 0 skips > old: 2849085 decicycles in pow_table, 512 runs, 0 skips > > Thus old to exp2 is roughly 2.25x speedup, exp2 to proposed roughly 3x > speedup, net ~ 6.7x speedup. took Michael's comment as a general ack, so pushed with addition of a comment and cycle numbers. ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] avcodec/nellymoserenc: avoid wasteful pow
On Tue, Dec 15, 2015 at 2:23 AM, Michael Niedermayerwrote: > On Wed, Dec 09, 2015 at 06:55:25PM -0500, Ganesh Ajjanagadde wrote: >> exp2 suffices here. Some trivial (~ 4x) speedup is done in addition here by >> reusing results. >> > >> This is bit-identical to the old values. > > That may be true with a specific compiler and version but > C makes no such guarantee unless iam missing something Indeed, fpu behavior is largely uncontrolled. What I meant was that under reasonable settings, there is no accuracy loss (tested with clang and gcc). Basically, a priori it is not obvious that this change is ok, but I tested and the values don't change. For that matter, this is a rather minor point - many pow implementations are anyway broken wrt precise < 1 ulp or better yet < 0.5 ulp rounding. > > >> >> Signed-off-by: Ganesh Ajjanagadde >> --- >> libavcodec/nellymoserenc.c | 11 +-- >> 1 file changed, 9 insertions(+), 2 deletions(-) >> >> diff --git a/libavcodec/nellymoserenc.c b/libavcodec/nellymoserenc.c >> index d998dba..e6023e3 100644 >> --- a/libavcodec/nellymoserenc.c >> +++ b/libavcodec/nellymoserenc.c >> @@ -179,8 +179,15 @@ static av_cold int encode_init(AVCodecContext *avctx) >> >> /* Generate overlap window */ >> ff_init_ff_sine_windows(7); >> -for (i = 0; i < POW_TABLE_SIZE; i++) >> -pow_table[i] = pow(2, -i / 2048.0 - 3.0 + POW_TABLE_OFFSET); >> +pow_table[0] = 1; >> +pow_table[1024] = M_SQRT1_2; >> +for (i = 1; i < 513; i++) { >> +double tmp = exp2(-i / 2048.0); >> +pow_table[i] = tmp; >> +pow_table[1024-i] = M_SQRT1_2 / tmp; >> +pow_table[1024+i] = tmp * M_SQRT1_2; >> +pow_table[2048-i] = 0.5 / tmp; > > how much overall init time is gained by this ? > that is time in ffmpeg main() from start to finish when just opening > the file with no decoding aka ./ffmpeg -i somefile Don't know, all I know is cycles are unnecessarily wasted. Will put in cycle numbers. > ? > > iam asking as this makes the intend of the loop harder to > see/understand > if this change is done then please add a comment explaining what the > code does or maybe leave the unoptimized code in a comment Comment will be added, sorry. Note that the trick may be employed to a finer resolution, ie by storing exp2(3/4) etc and utilizing in simple multiplications and divisions. There will come a point where the overhead of the binary size increase will dominate. I chose this as this is reasonably readable and improves speed at minimal binary size impact. > > either way replacing pow -> exp2 LGTM > > [...] > -- > Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB > > Why not whip the teacher when the pupil misbehaves? -- Diogenes of Sinope > > ___ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > http://ffmpeg.org/mailman/listinfo/ffmpeg-devel > ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] avcodec/nellymoserenc: avoid wasteful pow
On Tue, Dec 15, 2015 at 5:25 PM, Ganesh Ajjanagaddewrote: > On Tue, Dec 15, 2015 at 2:23 AM, Michael Niedermayer wrote: >> On Wed, Dec 09, 2015 at 06:55:25PM -0500, Ganesh Ajjanagadde wrote: [...] >>> >>> diff --git a/libavcodec/nellymoserenc.c b/libavcodec/nellymoserenc.c >>> index d998dba..e6023e3 100644 >>> --- a/libavcodec/nellymoserenc.c >>> +++ b/libavcodec/nellymoserenc.c >>> @@ -179,8 +179,15 @@ static av_cold int encode_init(AVCodecContext *avctx) >>> >>> /* Generate overlap window */ >>> ff_init_ff_sine_windows(7); >>> -for (i = 0; i < POW_TABLE_SIZE; i++) >>> -pow_table[i] = pow(2, -i / 2048.0 - 3.0 + POW_TABLE_OFFSET); >>> +pow_table[0] = 1; >>> +pow_table[1024] = M_SQRT1_2; >>> +for (i = 1; i < 513; i++) { >>> +double tmp = exp2(-i / 2048.0); >>> +pow_table[i] = tmp; >>> +pow_table[1024-i] = M_SQRT1_2 / tmp; >>> +pow_table[1024+i] = tmp * M_SQRT1_2; >>> +pow_table[2048-i] = 0.5 / tmp; >> >> how much overall init time is gained by this ? >> that is time in ffmpeg main() from start to finish when just opening >> the file with no decoding aka ./ffmpeg -i somefile > > Don't know, all I know is cycles are unnecessarily wasted. Will put in > cycle numbers. > Here they are: proposed: 424160 decicycles in pow_table, 512 runs, 0 skips exp2 only: 1262093 decicycles in pow_table, 512 runs, 0 skips old: 2849085 decicycles in pow_table, 512 runs, 0 skips Thus old to exp2 is roughly 2.25x speedup, exp2 to proposed roughly 3x speedup, net ~ 6.7x speedup. ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] avcodec/nellymoserenc: avoid wasteful pow
On Wed, Dec 9, 2015 at 6:55 PM, Ganesh Ajjanagaddewrote: > exp2 suffices here. Some trivial (~ 4x) speedup is done in addition here by > reusing results. > > This is bit-identical to the old values. > > Signed-off-by: Ganesh Ajjanagadde > --- > libavcodec/nellymoserenc.c | 11 +-- > 1 file changed, 9 insertions(+), 2 deletions(-) > > diff --git a/libavcodec/nellymoserenc.c b/libavcodec/nellymoserenc.c > index d998dba..e6023e3 100644 > --- a/libavcodec/nellymoserenc.c > +++ b/libavcodec/nellymoserenc.c > @@ -179,8 +179,15 @@ static av_cold int encode_init(AVCodecContext *avctx) > > /* Generate overlap window */ > ff_init_ff_sine_windows(7); > -for (i = 0; i < POW_TABLE_SIZE; i++) > -pow_table[i] = pow(2, -i / 2048.0 - 3.0 + POW_TABLE_OFFSET); > +pow_table[0] = 1; > +pow_table[1024] = M_SQRT1_2; > +for (i = 1; i < 513; i++) { > +double tmp = exp2(-i / 2048.0); > +pow_table[i] = tmp; > +pow_table[1024-i] = M_SQRT1_2 / tmp; > +pow_table[1024+i] = tmp * M_SQRT1_2; > +pow_table[2048-i] = 0.5 / tmp; > +} > > if (s->avctx->trellis) { > s->opt = av_malloc(NELLY_BANDS * OPT_SIZE * sizeof(float )); > -- > 2.6.3 > going in tomorrow in the absence of comments since it lacks a maintainer. ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] avcodec/nellymoserenc: avoid wasteful pow
On Wed, Dec 09, 2015 at 06:55:25PM -0500, Ganesh Ajjanagadde wrote: > exp2 suffices here. Some trivial (~ 4x) speedup is done in addition here by > reusing results. > > This is bit-identical to the old values. That may be true with a specific compiler and version but C makes no such guarantee unless iam missing something > > Signed-off-by: Ganesh Ajjanagadde> --- > libavcodec/nellymoserenc.c | 11 +-- > 1 file changed, 9 insertions(+), 2 deletions(-) > > diff --git a/libavcodec/nellymoserenc.c b/libavcodec/nellymoserenc.c > index d998dba..e6023e3 100644 > --- a/libavcodec/nellymoserenc.c > +++ b/libavcodec/nellymoserenc.c > @@ -179,8 +179,15 @@ static av_cold int encode_init(AVCodecContext *avctx) > > /* Generate overlap window */ > ff_init_ff_sine_windows(7); > -for (i = 0; i < POW_TABLE_SIZE; i++) > -pow_table[i] = pow(2, -i / 2048.0 - 3.0 + POW_TABLE_OFFSET); > +pow_table[0] = 1; > +pow_table[1024] = M_SQRT1_2; > +for (i = 1; i < 513; i++) { > +double tmp = exp2(-i / 2048.0); > +pow_table[i] = tmp; > +pow_table[1024-i] = M_SQRT1_2 / tmp; > +pow_table[1024+i] = tmp * M_SQRT1_2; > +pow_table[2048-i] = 0.5 / tmp; how much overall init time is gained by this ? that is time in ffmpeg main() from start to finish when just opening the file with no decoding aka ./ffmpeg -i somefile ? iam asking as this makes the intend of the loop harder to see/understand if this change is done then please add a comment explaining what the code does or maybe leave the unoptimized code in a comment either way replacing pow -> exp2 LGTM [...] -- Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB Why not whip the teacher when the pupil misbehaves? -- Diogenes of Sinope signature.asc Description: Digital signature ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH] avcodec/nellymoserenc: avoid wasteful pow
exp2 suffices here. Some trivial (~ 4x) speedup is done in addition here by reusing results. This is bit-identical to the old values. Signed-off-by: Ganesh Ajjanagadde--- libavcodec/nellymoserenc.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/libavcodec/nellymoserenc.c b/libavcodec/nellymoserenc.c index d998dba..e6023e3 100644 --- a/libavcodec/nellymoserenc.c +++ b/libavcodec/nellymoserenc.c @@ -179,8 +179,15 @@ static av_cold int encode_init(AVCodecContext *avctx) /* Generate overlap window */ ff_init_ff_sine_windows(7); -for (i = 0; i < POW_TABLE_SIZE; i++) -pow_table[i] = pow(2, -i / 2048.0 - 3.0 + POW_TABLE_OFFSET); +pow_table[0] = 1; +pow_table[1024] = M_SQRT1_2; +for (i = 1; i < 513; i++) { +double tmp = exp2(-i / 2048.0); +pow_table[i] = tmp; +pow_table[1024-i] = M_SQRT1_2 / tmp; +pow_table[1024+i] = tmp * M_SQRT1_2; +pow_table[2048-i] = 0.5 / tmp; +} if (s->avctx->trellis) { s->opt = av_malloc(NELLY_BANDS * OPT_SIZE * sizeof(float )); -- 2.6.3 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel