On Sat, Dec 31, 2011 at 03:09:44PM +0100, Christophe Gisquet wrote:
> Hi,
>
> first of all, a disclaimer: git is not my cup of tea. While this is
> presented as a series, I'll get back to working serially on each patch
> afterwards.
It would be easier to review as a lesser number of patches though.
> In addition, I'd like to focus on algorithm changes first and once
> this has stabilized, consider the rest.
>
> Here is a series of patches related to dequantization in RV 30/40:
> 0) MMX2 and SSE2 versions of the 4x4 dequant function
> Given the DCT and dequant formulas, I assumed the dequant was ok with
> 16bits intermediate (the neon code does not).
It's better to verify analytically though.
> This patch also adds the build system
> C is around 77 cycles, MMX is 31 and SSE2 is around 30 (using
> START/STOP_TIMER). If no further optimization are possible with the
> later, I don't see its point.
> This is around a 4% speed improvement here.
> 1) Move QP look-up out of loop
> This is in fact mostly cosmetical as I guess the compiler already does that.
> 2) Check for DC-only blocks by modifying how the first 2x2 subblock is handled
> This is a preliminary work for DC-only optimizations. It does not feel
> natural so I may have missed other opportunities.
> 3) DC-only dequantization+iDCT
> Classical optimization already present in H.264 and VC1 decoders at least
> The dequantization (of only the DC coeff) actually happens outside of
> the DSP function to help factorizing code
> 4) MMX2 optimization for the former
> This is hardly worth it: this goes down from around 29 cycles to 25.
>
> The whole provides a 7% speed improvement with unoptimized iDCT.
>
> Best regards,
> Christophe
> diff --git a/libavcodec/rv34.c b/libavcodec/rv34.c
> index 1a126be..7cbd798 100644
> --- a/libavcodec/rv34.c
> +++ b/libavcodec/rv34.c
> @@ -1145,7 +1145,9 @@ static int rv34_decode_macroblock(RV34DecContext *r,
> int8_t *intra_types)
> blkoff = ((i & 1) << 2) + ((i & 4) << 3);
> if(cbp & 1)
> rv34_decode_block(s->block[blknum] + blkoff, gb, r->cur_vlcs,
> r->luma_vlc, 0);
> + //{START_TIMER
> r->rdsp.rv34_dequant4x4(s->block[blknum] + blkoff,
> rv34_qscale_tab[s->qscale],rv34_qscale_tab[s->qscale]);
> + //STOP_TIMER("first")}
> if(r->is16) //FIXME: optimize
> s->block[blknum][blkoff] = block16[(i & 3) | ((i & 0xC) << 1)];
> r->rdsp.rv34_inv_transform_tab[0](s->block[blknum] + blkoff);
> @@ -1157,7 +1159,9 @@ static int rv34_decode_macroblock(RV34DecContext *r,
> int8_t *intra_types)
> blknum = ((i & 4) >> 2) + 4;
> blkoff = ((i & 1) << 2) + ((i & 2) << 4);
> rv34_decode_block(s->block[blknum] + blkoff, gb, r->cur_vlcs,
> r->chroma_vlc, 1);
> + //{START_TIMER
> r->rdsp.rv34_dequant4x4(s->block[blknum] + blkoff,
> rv34_qscale_tab[rv34_chroma_quant[1][s->qscale]],rv34_qscale_tab[rv34_chroma_quant[0][s->qscale]]);
> + //STOP_TIMER("second")}
> r->rdsp.rv34_inv_transform_tab[0](s->block[blknum] + blkoff);
> }
> if (IS_INTRA(s->current_picture_ptr->f.mb_type[mb_pos]))
This patch is definitely unneeded.
> diff --git a/libavcodec/rv34dsp.c b/libavcodec/rv34dsp.c
> index 974bf9e..302cb17 100644
> --- a/libavcodec/rv34dsp.c
> +++ b/libavcodec/rv34dsp.c
> @@ -122,4 +122,8 @@ av_cold void ff_rv34dsp_init(RV34DSPContext *c,
> DSPContext* dsp) {
>
> if (HAVE_NEON)
> ff_rv34dsp_init_neon(c, dsp);
> +#ifdef HAVE_YASM
> + if (HAVE_MMX)
> + ff_rv34dsp_init_x86(c, dsp);
> +#endif
> }
Sorry, this one should be done differently - drop #ifdef here and add it to
x86/rv34dsp.c. The logic is that only code in libavcodec/x86 should care about
Yasm present or not (there's one exception but it's really ugly).
> diff --git a/libavcodec/rv34dsp.h b/libavcodec/rv34dsp.h
> index 01352ea..b4de5f3 100644
> --- a/libavcodec/rv34dsp.h
> +++ b/libavcodec/rv34dsp.h
> @@ -67,6 +67,7 @@ void ff_rv34dsp_init(RV34DSPContext *c, DSPContext* dsp);
> void ff_rv40dsp_init(RV34DSPContext *c, DSPContext* dsp);
>
> void ff_rv34dsp_init_neon(RV34DSPContext *c, DSPContext *dsp);
> +void ff_rv34dsp_init_x86(RV34DSPContext *c, DSPContext *dsp);
>
> void ff_rv40dsp_init_x86(RV34DSPContext *c, DSPContext *dsp);
> void ff_rv40dsp_init_neon(RV34DSPContext *c, DSPContext *dsp);
trivial, so OK
> diff --git a/libavcodec/x86/Makefile b/libavcodec/x86/Makefile
> index aa97942..40b3457 100644
> --- a/libavcodec/x86/Makefile
> +++ b/libavcodec/x86/Makefile
> @@ -24,7 +24,11 @@ YASM-OBJS-$(CONFIG_H264PRED) +=
> x86/h264_intrapred.o \
> x86/h264_intrapred_10bit.o
> MMX-OBJS-$(CONFIG_H264PRED) += x86/h264_intrapred_init.o
>
> -MMX-OBJS-$(CONFIG_RV40_DECODER) += x86/rv40dsp.o \
> +YASM-OBJS-$(CONFIG_RV30_DECODER) += x86/rv34dsp_init.o \
> + x86/rv34dsp.o
>
> +MMX-OBJS-$(CONFIG_RV40_DECODER) += x86/rv40dsp.o
> +YASM-OBJS-$(CONFIG_RV40_DECODER) += x86/rv34dsp_init.o \
> + x86/rv34dsp.o
>
>
> YASM-OBJS-$(CONFIG_VC1_DECODER) += x86/vc1dsp_yasm.o
>
Should be changed to have ff_rv34dsp_init_x86() compiled unded HAVE_MMX
> diff --git a/libavcodec/x86/rv34dsp.asm b/libavcodec/x86/rv34dsp.asm
> new file mode 100644
> index 0000000..7d51a55
> --- /dev/null
> +++ b/libavcodec/x86/rv34dsp.asm
> @@ -0,0 +1,114 @@
> +;******************************************************************************
> +;* MMX/SSE2-optimized functions for the RV30 and RV40 decoders
> +;* Copyright (C) 2012 <[email protected]>
Your name here wouldn't hurt either.
I don't review that since there are more experienced people.
[...]
> diff --git a/libavcodec/x86/rv34dsp_init.c b/libavcodec/x86/rv34dsp_init.c
> new file mode 100644
> index 0000000..2d4f8f3
> --- /dev/null
> +++ b/libavcodec/x86/rv34dsp_init.c
> @@ -0,0 +1,43 @@
> +/*
> + * RV30/40 MMX/SSE2 optimizations
> + * Copyright (C) 2012 Christophe Gisquet <[email protected]>
> + *
> + * This file is part of Libav.
> + *
> + * Libav is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2.1 of the License, or (at your option) any later version.
> + *
> + * Libav is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with Libav; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301
> USA
> + */
> +
> +#include "libavutil/cpu.h"
> +#include "libavutil/x86_cpu.h"
> +#include "libavcodec/dsputil.h"
> +#include "libavcodec/rv34dsp.h"
> +
> +void ff_rv34_dequant4x4_mmx2(DCTELEM *block, int Qdc, int Q);
> +void ff_rv34_dequant4x4_sse2(DCTELEM *block, int Qdc, int Q);
> +
> +av_cold void ff_rv34dsp_init_x86(RV34DSPContext* c, DSPContext *dsp)
> +{
> +#if HAVE_YASM
> + int mm_flags = av_get_cpu_flags();
> +
> + if (mm_flags & AV_CPU_FLAG_MMX2) {
> + c->rv34_dequant4x4 = ff_rv34_dequant4x4_mmx2;
> + }
> +
> + if (mm_flags & AV_CPU_FLAG_SSE2) {
> + c->rv34_dequant4x4 = ff_rv34_dequant4x4_sse2;
> + }
> +#endif
> +}
> diff --git a/libavcodec/rv34.c b/libavcodec/rv34.c
> index 7cbd798..6f3c363 100644
> --- a/libavcodec/rv34.c
> +++ b/libavcodec/rv34.c
> @@ -1097,6 +1097,7 @@ static int rv34_decode_macroblock(RV34DecContext *r,
> int8_t *intra_types)
> MpegEncContext *s = &r->s;
> GetBitContext *gb = &s->gb;
> int cbp, cbp2;
> + int q_ac, q_dc;
> int i, blknum, blkoff;
> LOCAL_ALIGNED_16(DCTELEM, block16, [64]);
> int luma_dc_quant;
> @@ -1132,10 +1133,11 @@ static int rv34_decode_macroblock(RV34DecContext *r,
> int8_t *intra_types)
> return -1;
>
> luma_dc_quant = r->block_type == RV34_MB_P_MIX16x16 ?
> r->luma_dc_quant_p[s->qscale] : r->luma_dc_quant_i[s->qscale];
> + q_ac = rv34_qscale_tab[s->qscale];
> if(r->is16){
> memset(block16, 0, 64 * sizeof(*block16));
> rv34_decode_block(block16, gb, r->cur_vlcs, 3, 0);
> - rv34_dequant4x4_16x16(block16,
> rv34_qscale_tab[luma_dc_quant],rv34_qscale_tab[s->qscale]);
> + rv34_dequant4x4_16x16(block16, rv34_qscale_tab[luma_dc_quant], q_ac);
> r->rdsp.rv34_inv_transform_tab[1](block16);
> }
>
> @@ -1146,7 +1148,7 @@ static int rv34_decode_macroblock(RV34DecContext *r,
> int8_t *intra_types)
> if(cbp & 1)
> rv34_decode_block(s->block[blknum] + blkoff, gb, r->cur_vlcs,
> r->luma_vlc, 0);
> //{START_TIMER
> - r->rdsp.rv34_dequant4x4(s->block[blknum] + blkoff,
> rv34_qscale_tab[s->qscale],rv34_qscale_tab[s->qscale]);
> + r->rdsp.rv34_dequant4x4(s->block[blknum] + blkoff, q_ac, q_ac);
> //STOP_TIMER("first")}
> if(r->is16) //FIXME: optimize
> s->block[blknum][blkoff] = block16[(i & 3) | ((i & 0xC) << 1)];
> @@ -1154,13 +1156,15 @@ static int rv34_decode_macroblock(RV34DecContext *r,
> int8_t *intra_types)
> }
> if(r->block_type == RV34_MB_P_MIX16x16)
> r->cur_vlcs = choose_vlc_set(r->si.quant, r->si.vlc_set, 1);
> + q_dc = rv34_qscale_tab[rv34_chroma_quant[0][s->qscale]];
> + q_ac = rv34_qscale_tab[rv34_chroma_quant[1][s->qscale]];
> for(; i < 24; i++, cbp >>= 1){
> if(!(cbp & 1)) continue;
> blknum = ((i & 4) >> 2) + 4;
> blkoff = ((i & 1) << 2) + ((i & 2) << 4);
> rv34_decode_block(s->block[blknum] + blkoff, gb, r->cur_vlcs,
> r->chroma_vlc, 1);
> //{START_TIMER
> - r->rdsp.rv34_dequant4x4(s->block[blknum] + blkoff,
> rv34_qscale_tab[rv34_chroma_quant[1][s->qscale]],rv34_qscale_tab[rv34_chroma_quant[0][s->qscale]]);
> + r->rdsp.rv34_dequant4x4(s->block[blknum] + blkoff, q_ac, q_dc);
> //STOP_TIMER("second")}
> r->rdsp.rv34_inv_transform_tab[0](s->block[blknum] + blkoff);
> }
This change looks OK
> diff --git a/libavcodec/rv34.c b/libavcodec/rv34.c
> index 6f3c363..6cf839e 100644
> --- a/libavcodec/rv34.c
> +++ b/libavcodec/rv34.c
> @@ -230,6 +230,15 @@ static inline void decode_coeff(DCTELEM *dst, int coef,
> int esc, GetBitContext *
> }
>
> /**
> + * Decode a single coefficient.
> + */
> +static inline void decode_subblock1(DCTELEM *dst, int code, GetBitContext
> *gb, VLC *vlc)
> +{
> + int coeff = modulo_three_table[code][0];
> + decode_coeff(dst, coeff, 3, gb, vlc);
> +}
> +
> +/**
> * Decode 2x2 subblock of coefficients.
> */
> static inline void decode_subblock(DCTELEM *dst, int code, const int
> is_block2, GetBitContext *gb, VLC *vlc)
> @@ -262,16 +271,22 @@ static inline void decode_subblock(DCTELEM *dst, int
> code, const int is_block2,
> * o--o
> */
>
> -static inline void rv34_decode_block(DCTELEM *dst, GetBitContext *gb,
> RV34VLC *rvlc, int fc, int sc)
> +static inline int rv34_decode_block(DCTELEM *dst, GetBitContext *gb, RV34VLC
> *rvlc, int fc, int sc)
> {
> - int code, pattern;
> + int code, pattern, has_ac = 1;
>
> code = get_vlc2(gb, rvlc->first_pattern[fc].table, 9, 2);
>
> pattern = code & 0x7;
>
> code >>= 3;
> - decode_subblock(dst, code, 0, gb, &rvlc->coefficient);
> + if (!code || code==27 || code==54 || code==81) {
> + decode_subblock1(dst, code, gb, &rvlc->coefficient);
> + if (!pattern)
> + return 0;
> + has_ac = 0;
> + } else
> + decode_subblock(dst, code, 0, gb, &rvlc->coefficient);
>
> if(pattern & 4){
> code = get_vlc2(gb, rvlc->second_pattern[sc].table, 9, 2);
> @@ -285,7 +300,7 @@ static inline void rv34_decode_block(DCTELEM *dst,
> GetBitContext *gb, RV34VLC *r
> code = get_vlc2(gb, rvlc->third_pattern[sc].table, 9, 2);
> decode_subblock(dst + 8*2+2, code, 0, gb, &rvlc->coefficient);
> }
> -
> + return pattern || has_ac;
> }
>
> /**
weird indentation and does it really make any speed difference?
> diff --git a/libavcodec/rv34.c b/libavcodec/rv34.c
> index 6cf839e..4123c8c 100644
> --- a/libavcodec/rv34.c
> +++ b/libavcodec/rv34.c
> @@ -284,6 +284,7 @@ static inline int rv34_decode_block(DCTELEM *dst,
> GetBitContext *gb, RV34VLC *rv
> decode_subblock1(dst, code, gb, &rvlc->coefficient);
> if (!pattern)
> return 0;
> + // Does happen
//So what?
> has_ac = 0;
> } else
> decode_subblock(dst, code, 0, gb, &rvlc->coefficient);
> @@ -1160,28 +1161,41 @@ static int rv34_decode_macroblock(RV34DecContext *r,
> int8_t *intra_types)
> if(!r->is16 && !(cbp & 1)) continue;
> blknum = ((i & 2) >> 1) + ((i & 8) >> 2);
> blkoff = ((i & 1) << 2) + ((i & 4) << 3);
> - if(cbp & 1)
> + if(cbp & 1) {
> rv34_decode_block(s->block[blknum] + blkoff, gb, r->cur_vlcs,
> r->luma_vlc, 0);
> - //{START_TIMER
> - r->rdsp.rv34_dequant4x4(s->block[blknum] + blkoff, q_ac, q_ac);
> - //STOP_TIMER("first")}
> - if(r->is16) //FIXME: optimize
> - s->block[blknum][blkoff] = block16[(i & 3) | ((i & 0xC) << 1)];
> - r->rdsp.rv34_inv_transform_tab[0](s->block[blknum] + blkoff);
> + //START_TIMER
> + r->rdsp.rv34_dequant4x4(s->block[blknum] + blkoff, q_ac, q_ac);
> + //STOP_TIMER("first")
> + if(r->is16) //FIXME: optimize
> + s->block[blknum][blkoff] = block16[(i & 3) | ((i & 0xC) <<
> 1)];
> + r->rdsp.rv34_inv_transform_tab[0](s->block[blknum] + blkoff);
> + } else if(r->is16) {
> + // This does happen
//Indeed it does, so what?
> + r->rdsp.rv34_idct_dequant4x4_dc(s->block[blknum] + blkoff,
> + block16[(i & 3) | ((i & 0xC) <<
> 1)]);
> + } else {
> + DCTELEM *ptr = s->block[blknum] + blkoff;
> + r->rdsp.rv34_idct_dequant4x4_dc(ptr, (ptr[0] * q_ac + 8) >> 4);
> + }
> }
> if(r->block_type == RV34_MB_P_MIX16x16)
> r->cur_vlcs = choose_vlc_set(r->si.quant, r->si.vlc_set, 1);
> - q_dc = rv34_qscale_tab[rv34_chroma_quant[0][s->qscale]];
> - q_ac = rv34_qscale_tab[rv34_chroma_quant[1][s->qscale]];
> + q_ac = rv34_qscale_tab[rv34_chroma_quant[0][s->qscale]];
> + q_dc = rv34_qscale_tab[rv34_chroma_quant[1][s->qscale]];
> for(; i < 24; i++, cbp >>= 1){
> if(!(cbp & 1)) continue;
> blknum = ((i & 4) >> 2) + 4;
> blkoff = ((i & 1) << 2) + ((i & 2) << 4);
> has_ac = rv34_decode_block(s->block[blknum] + blkoff, gb,
> r->cur_vlcs, r->chroma_vlc, 1);
> - //{START_TIMER
> - r->rdsp.rv34_dequant4x4(s->block[blknum] + blkoff, q_ac, q_dc);
> - //STOP_TIMER("second")}
> - r->rdsp.rv34_inv_transform_tab[0](s->block[blknum] + blkoff);
> + if (has_ac) {
> + //START_TIMER
> + r->rdsp.rv34_dequant4x4(s->block[blknum] + blkoff, q_dc, q_ac);
> + //STOP_TIMER("second")
> + r->rdsp.rv34_inv_transform_tab[0](s->block[blknum] + blkoff);
> + } else {
> + DCTELEM *ptr = s->block[blknum] + blkoff;
> + r->rdsp.rv34_idct_dequant4x4_dc(ptr, (ptr[0] * q_dc + 8) >> 4);
> + }
> }
> if (IS_INTRA(s->current_picture_ptr->f.mb_type[mb_pos]))
> rv34_output_macroblock(r, intra_types, cbp2, r->is16);
Approach seems good though.
> diff --git a/libavcodec/rv34dsp.c b/libavcodec/rv34dsp.c
> index 302cb17..3fcf4ae 100644
> --- a/libavcodec/rv34dsp.c
> +++ b/libavcodec/rv34dsp.c
> @@ -114,11 +114,22 @@ static void rv34_dequant4x4_c(DCTELEM *block, int Qdc,
> int Q)
> block[j + i*8] = (block[j + i*8] * Q + 8) >> 4;
> }
>
> +static void rv34_idct_dequant4x4_dc_c(DCTELEM *block, int dc)
> +{
> + int i, j;
> +
> + dc = ( 13*13*dc + 0x200 ) >> 10;
weird formatting
> + for (i = 0; i < 4; i++, block += 8)
> + for (j = 0; j < 4; j++)
> + block[j] = dc;
> +}
> +
> av_cold void ff_rv34dsp_init(RV34DSPContext *c, DSPContext* dsp) {
> c->rv34_inv_transform_tab[0] = rv34_inv_transform_c;
> c->rv34_inv_transform_tab[1] = rv34_inv_transform_noround_c;
>
> c->rv34_dequant4x4 = rv34_dequant4x4_c;
> + c->rv34_idct_dequant4x4_dc = rv34_idct_dequant4x4_dc_c;
>
> if (HAVE_NEON)
> ff_rv34dsp_init_neon(c, dsp);
the rest seems passable.
P.S. Maybe you'll send it next time at least in a separate batches, tracking
different patch changes would be hard for us both.
_______________________________________________
libav-devel mailing list
[email protected]
https://lists.libav.org/mailman/listinfo/libav-devel