Re: [FFmpeg-devel] [PATCH] avcodec/proresenc_anatoliy: change quantization scaling to floating point to utilize vectorization
On 27 February 2018 at 21:22, David Murmannwrote: > > On 2/27/2018 9:58 PM, Hendrik Leppkes wrote: > > On Tue, Feb 27, 2018 at 9:35 PM, David Murmann > wrote: > >> Quantization scaling seems to be a slight bottleneck, > >> this change allows the compiler to more easily vectorize > >> the loop. This improves total encoding performance in my > >> tests by about 10-20%. > >> > >> Signed-off-by: David Murmann > >> --- > >> libavcodec/proresenc_anatoliy.c | 12 > >> 1 file changed, 8 insertions(+), 4 deletions(-) > >> > [...] > >> +for (j = 0; j < blocks_per_slice; j++) { > >> +for (i = 0; i < 64; i++) { > >> +block[i] = (float)in[(j << 6) + i] / (float)qmat[i]; > >> +} > >> + > >> +for (i = 1; i < 64; i++) { > >> +int val = block[progressive_scan[i]]; > >> if (val) { > >> encode_codeword(pb, run, run_to_cb[FFMIN(prev_run, > 15)]); > > > > Usually, using float is best avoided. Did you test re-factoring the > > loop structure without changing it to float? > > Yes, the vector instructions don't have integer division, AFAIK, and the > compiler just generates a loop with idivs. This is quite a bit slower > than converting to float, dividing and converting back, if the compiler > uses vector instructions. In the general case this wouldn't be exact, > but since the input values are int16 they should losslessly fit into > float32. On platforms where this auto-vectorization fails this might > actually be quite a bit slower, but I have not seen that in my tests > (though I have only tested on x86_64). > > -- > David Murmann > > da...@btf.de > Telefon +49 (0) 221 82008710 > Fax +49 (0) 221 82008799 > > http://btf.de/ > > -- > btf GmbH | Leyendeckerstr. 27, 50825 Köln | +49 (0) 221 82 00 87 10 > Geschäftsführer: Philipp Käßbohrer & Matthias Murmann | HR Köln | HRB 74707 > ___ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > http://ffmpeg.org/mailman/listinfo/ffmpeg-devel > No, you're going about it the wrong way. Floats should most definitely be avoided in encoders/decoders. Non-deterministic output on platforms is a smaller issue to how they can obliterate performance if compilers emit an actual div instruction. Instead, here's what you can do to make it even faster: replace the division with a multiply + a shift. Keeps the output identical too. I've just sent an old patch of mine (for a different but similar codec) you can work off of - just take the last bit of code there, run it at init to generate the LUTs for all quantizers and then just multiply and shift by looking into the tables you generate. Here's the link: http://ffmpeg.org/pipermail/ffmpeg-devel/2018-February/225867.html ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] avcodec/proresenc_anatoliy: change quantization scaling to floating point to utilize vectorization
On 2/27/2018 9:58 PM, Hendrik Leppkes wrote: > On Tue, Feb 27, 2018 at 9:35 PM, David Murmannwrote: >> Quantization scaling seems to be a slight bottleneck, >> this change allows the compiler to more easily vectorize >> the loop. This improves total encoding performance in my >> tests by about 10-20%. >> >> Signed-off-by: David Murmann >> --- >> libavcodec/proresenc_anatoliy.c | 12 >> 1 file changed, 8 insertions(+), 4 deletions(-) >> [...] >> +for (j = 0; j < blocks_per_slice; j++) { >> +for (i = 0; i < 64; i++) { >> +block[i] = (float)in[(j << 6) + i] / (float)qmat[i]; >> +} >> + >> +for (i = 1; i < 64; i++) { >> +int val = block[progressive_scan[i]]; >> if (val) { >> encode_codeword(pb, run, run_to_cb[FFMIN(prev_run, 15)]); > > Usually, using float is best avoided. Did you test re-factoring the > loop structure without changing it to float? Yes, the vector instructions don't have integer division, AFAIK, and the compiler just generates a loop with idivs. This is quite a bit slower than converting to float, dividing and converting back, if the compiler uses vector instructions. In the general case this wouldn't be exact, but since the input values are int16 they should losslessly fit into float32. On platforms where this auto-vectorization fails this might actually be quite a bit slower, but I have not seen that in my tests (though I have only tested on x86_64). -- David Murmann da...@btf.de Telefon +49 (0) 221 82008710 Fax +49 (0) 221 82008799 http://btf.de/ -- btf GmbH | Leyendeckerstr. 27, 50825 Köln | +49 (0) 221 82 00 87 10 Geschäftsführer: Philipp Käßbohrer & Matthias Murmann | HR Köln | HRB 74707 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH] avcodec/proresenc_anatoliy: change quantization scaling to floating point to utilize vectorization
On Tue, Feb 27, 2018 at 9:35 PM, David Murmannwrote: > Quantization scaling seems to be a slight bottleneck, > this change allows the compiler to more easily vectorize > the loop. This improves total encoding performance in my > tests by about 10-20%. > > Signed-off-by: David Murmann > --- > libavcodec/proresenc_anatoliy.c | 12 > 1 file changed, 8 insertions(+), 4 deletions(-) > > diff --git a/libavcodec/proresenc_anatoliy.c > b/libavcodec/proresenc_anatoliy.c > index 0516066163..8b296f6f1b 100644 > --- a/libavcodec/proresenc_anatoliy.c > +++ b/libavcodec/proresenc_anatoliy.c > @@ -232,14 +232,18 @@ static const uint8_t lev_to_cb[10] = { 0x04, 0x0A, > 0x05, 0x06, 0x04, 0x28, > static void encode_ac_coeffs(AVCodecContext *avctx, PutBitContext *pb, > int16_t *in, int blocks_per_slice, int *qmat) > { > +int16_t block[64]; > int prev_run = 4; > int prev_level = 2; > int run = 0, level, code, i, j; > -for (i = 1; i < 64; i++) { > -int indp = progressive_scan[i]; > -for (j = 0; j < blocks_per_slice; j++) { > -int val = QSCALE(qmat, indp, in[(j << 6) + indp]); > +for (j = 0; j < blocks_per_slice; j++) { > +for (i = 0; i < 64; i++) { > +block[i] = (float)in[(j << 6) + i] / (float)qmat[i]; > +} > + > +for (i = 1; i < 64; i++) { > +int val = block[progressive_scan[i]]; > if (val) { > encode_codeword(pb, run, run_to_cb[FFMIN(prev_run, 15)]); Usually, using float is best avoided. Did you test re-factoring the loop structure without changing it to float? - Hendrik ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH] avcodec/proresenc_anatoliy: change quantization scaling to floating point to utilize vectorization
Quantization scaling seems to be a slight bottleneck, this change allows the compiler to more easily vectorize the loop. This improves total encoding performance in my tests by about 10-20%. Signed-off-by: David Murmann--- libavcodec/proresenc_anatoliy.c | 12 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/libavcodec/proresenc_anatoliy.c b/libavcodec/proresenc_anatoliy.c index 0516066163..8b296f6f1b 100644 --- a/libavcodec/proresenc_anatoliy.c +++ b/libavcodec/proresenc_anatoliy.c @@ -232,14 +232,18 @@ static const uint8_t lev_to_cb[10] = { 0x04, 0x0A, 0x05, 0x06, 0x04, 0x28, static void encode_ac_coeffs(AVCodecContext *avctx, PutBitContext *pb, int16_t *in, int blocks_per_slice, int *qmat) { +int16_t block[64]; int prev_run = 4; int prev_level = 2; int run = 0, level, code, i, j; -for (i = 1; i < 64; i++) { -int indp = progressive_scan[i]; -for (j = 0; j < blocks_per_slice; j++) { -int val = QSCALE(qmat, indp, in[(j << 6) + indp]); +for (j = 0; j < blocks_per_slice; j++) { +for (i = 0; i < 64; i++) { +block[i] = (float)in[(j << 6) + i] / (float)qmat[i]; +} + +for (i = 1; i < 64; i++) { +int val = block[progressive_scan[i]]; if (val) { encode_codeword(pb, run, run_to_cb[FFMIN(prev_run, 15)]); -- 2.16.2 -- btf GmbH | Leyendeckerstr. 27, 50825 Köln | +49 (0) 221 82 00 87 10 Geschäftsführer: Philipp Käßbohrer & Matthias Murmann | HR Köln | HRB 74707 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel