Re: [libav-devel] [PATCH] Various dequant optimizations for RV30/40

Kostya Shishkov Sat, 31 Dec 2011 09:05:22 -0800

On Sat, Dec 31, 2011 at 03:09:44PM +0100, Christophe Gisquet wrote:
> Hi,
> 
> first of all, a disclaimer: git is not my cup of tea. While this is
> presented as a series, I'll get back to working serially on each patch
> afterwards.


It would be easier to review as a lesser number of patches though.

> In addition, I'd like to focus on algorithm changes first and once
> this has stabilized, consider the rest.
> 
> Here is a series of patches related to dequantization in RV 30/40:
> 0) MMX2 and SSE2 versions of the 4x4 dequant function
> Given the DCT and dequant formulas, I assumed the dequant was ok with
> 16bits intermediate (the neon code does not).

It's better to verify analytically though.

> This patch also adds the build system
> C is around 77 cycles, MMX is 31 and SSE2 is around 30 (using
> START/STOP_TIMER). If no further optimization are possible with the
> later, I don't see its point.
> This is around a 4% speed improvement here.
> 1) Move QP look-up out of loop
> This is in fact mostly cosmetical as I guess the compiler already does that.
> 2) Check for DC-only blocks by modifying how the first 2x2 subblock is handled
> This is a preliminary work for DC-only optimizations. It does not feel
> natural so I may have missed other opportunities.
> 3) DC-only dequantization+iDCT
> Classical optimization already present in H.264 and VC1 decoders at least
> The dequantization (of only the DC coeff) actually happens outside of
> the DSP function to help factorizing code
> 4) MMX2 optimization for the former
> This is hardly worth it: this goes down from around 29 cycles to 25.
> 
> The whole provides a 7% speed improvement with unoptimized iDCT.
> 
> Best regards,
> Christophe

> diff --git a/libavcodec/rv34.c b/libavcodec/rv34.c
> index 1a126be..7cbd798 100644
> --- a/libavcodec/rv34.c
> +++ b/libavcodec/rv34.c
> @@ -1145,7 +1145,9 @@ static int rv34_decode_macroblock(RV34DecContext *r, 
> int8_t *intra_types)
>          blkoff = ((i & 1) << 2) + ((i & 4) << 3);
>          if(cbp & 1)
>              rv34_decode_block(s->block[blknum] + blkoff, gb, r->cur_vlcs, 
> r->luma_vlc, 0);
> +        //{START_TIMER
>          r->rdsp.rv34_dequant4x4(s->block[blknum] + blkoff, 
> rv34_qscale_tab[s->qscale],rv34_qscale_tab[s->qscale]);
> +        //STOP_TIMER("first")}
>          if(r->is16) //FIXME: optimize
>              s->block[blknum][blkoff] = block16[(i & 3) | ((i & 0xC) << 1)];
>          r->rdsp.rv34_inv_transform_tab[0](s->block[blknum] + blkoff);
> @@ -1157,7 +1159,9 @@ static int rv34_decode_macroblock(RV34DecContext *r, 
> int8_t *intra_types)
>          blknum = ((i & 4) >> 2) + 4;
>          blkoff = ((i & 1) << 2) + ((i & 2) << 4);
>          rv34_decode_block(s->block[blknum] + blkoff, gb, r->cur_vlcs, 
> r->chroma_vlc, 1);
> +        //{START_TIMER
>          r->rdsp.rv34_dequant4x4(s->block[blknum] + blkoff, 
> rv34_qscale_tab[rv34_chroma_quant[1][s->qscale]],rv34_qscale_tab[rv34_chroma_quant[0][s->qscale]]);
> +        //STOP_TIMER("second")}
>          r->rdsp.rv34_inv_transform_tab[0](s->block[blknum] + blkoff);
>      }
>      if (IS_INTRA(s->current_picture_ptr->f.mb_type[mb_pos]))

This patch is definitely unneeded.

> diff --git a/libavcodec/rv34dsp.c b/libavcodec/rv34dsp.c
> index 974bf9e..302cb17 100644
> --- a/libavcodec/rv34dsp.c
> +++ b/libavcodec/rv34dsp.c
> @@ -122,4 +122,8 @@ av_cold void ff_rv34dsp_init(RV34DSPContext *c, 
> DSPContext* dsp) {
>  
>      if (HAVE_NEON)
>          ff_rv34dsp_init_neon(c, dsp);
> +#ifdef HAVE_YASM
> +    if (HAVE_MMX)
> +        ff_rv34dsp_init_x86(c, dsp);
> +#endif
>  }

Sorry, this one should be done differently - drop #ifdef here and add it to
x86/rv34dsp.c. The logic is that only code in libavcodec/x86 should care about
Yasm present or not (there's one exception but it's really ugly).

> diff --git a/libavcodec/rv34dsp.h b/libavcodec/rv34dsp.h
> index 01352ea..b4de5f3 100644
> --- a/libavcodec/rv34dsp.h
> +++ b/libavcodec/rv34dsp.h
> @@ -67,6 +67,7 @@ void ff_rv34dsp_init(RV34DSPContext *c, DSPContext* dsp);
>  void ff_rv40dsp_init(RV34DSPContext *c, DSPContext* dsp);
>  
>  void ff_rv34dsp_init_neon(RV34DSPContext *c, DSPContext *dsp);
> +void ff_rv34dsp_init_x86(RV34DSPContext *c, DSPContext *dsp);
>  
>  void ff_rv40dsp_init_x86(RV34DSPContext *c, DSPContext *dsp);
>  void ff_rv40dsp_init_neon(RV34DSPContext *c, DSPContext *dsp);

trivial, so OK

> diff --git a/libavcodec/x86/Makefile b/libavcodec/x86/Makefile
> index aa97942..40b3457 100644
> --- a/libavcodec/x86/Makefile
> +++ b/libavcodec/x86/Makefile
> @@ -24,7 +24,11 @@ YASM-OBJS-$(CONFIG_H264PRED)           += 
> x86/h264_intrapred.o          \
>                                            x86/h264_intrapred_10bit.o
>  MMX-OBJS-$(CONFIG_H264PRED)            += x86/h264_intrapred_init.o
>  
> -MMX-OBJS-$(CONFIG_RV40_DECODER)        += x86/rv40dsp.o                 \
> +YASM-OBJS-$(CONFIG_RV30_DECODER)       += x86/rv34dsp_init.o            \
> +                                          x86/rv34dsp.o                      
>                     
> +MMX-OBJS-$(CONFIG_RV40_DECODER)        += x86/rv40dsp.o
> +YASM-OBJS-$(CONFIG_RV40_DECODER)       += x86/rv34dsp_init.o            \
> +                                          x86/rv34dsp.o                      
>                     
>  
>  YASM-OBJS-$(CONFIG_VC1_DECODER)        += x86/vc1dsp_yasm.o
>  

Should be changed to have ff_rv34dsp_init_x86() compiled unded HAVE_MMX

> diff --git a/libavcodec/x86/rv34dsp.asm b/libavcodec/x86/rv34dsp.asm
> new file mode 100644
> index 0000000..7d51a55
> --- /dev/null
> +++ b/libavcodec/x86/rv34dsp.asm
> @@ -0,0 +1,114 @@
> +;******************************************************************************
> +;* MMX/SSE2-optimized functions for the RV30 and RV40 decoders
> +;* Copyright (C) 2012 <[email protected]>

Your name here wouldn't hurt either.
I don't review that since there are more experienced people.
[...]

> diff --git a/libavcodec/x86/rv34dsp_init.c b/libavcodec/x86/rv34dsp_init.c
> new file mode 100644
> index 0000000..2d4f8f3
> --- /dev/null
> +++ b/libavcodec/x86/rv34dsp_init.c
> @@ -0,0 +1,43 @@
> +/*
> + * RV30/40 MMX/SSE2 optimizations
> + * Copyright (C) 2012 Christophe Gisquet <[email protected]>
> + *
> + * This file is part of Libav.
> + *
> + * Libav is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2.1 of the License, or (at your option) any later version.
> + *
> + * Libav is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with Libav; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 
> USA
> + */
> +
> +#include "libavutil/cpu.h"
> +#include "libavutil/x86_cpu.h"
> +#include "libavcodec/dsputil.h"
> +#include "libavcodec/rv34dsp.h"
> +
> +void ff_rv34_dequant4x4_mmx2(DCTELEM *block, int Qdc, int Q);
> +void ff_rv34_dequant4x4_sse2(DCTELEM *block, int Qdc, int Q);
> +
> +av_cold void ff_rv34dsp_init_x86(RV34DSPContext* c, DSPContext *dsp)
> +{
> +#if HAVE_YASM
> +    int mm_flags = av_get_cpu_flags();
> +
> +    if (mm_flags & AV_CPU_FLAG_MMX2) {
> +        c->rv34_dequant4x4 = ff_rv34_dequant4x4_mmx2;
> +    }
> +
> +    if (mm_flags & AV_CPU_FLAG_SSE2) {
> +        c->rv34_dequant4x4 = ff_rv34_dequant4x4_sse2;
> +    }
> +#endif
> +}

> diff --git a/libavcodec/rv34.c b/libavcodec/rv34.c
> index 7cbd798..6f3c363 100644
> --- a/libavcodec/rv34.c
> +++ b/libavcodec/rv34.c
> @@ -1097,6 +1097,7 @@ static int rv34_decode_macroblock(RV34DecContext *r, 
> int8_t *intra_types)
>      MpegEncContext *s = &r->s;
>      GetBitContext *gb = &s->gb;
>      int cbp, cbp2;
> +    int q_ac, q_dc;
>      int i, blknum, blkoff;
>      LOCAL_ALIGNED_16(DCTELEM, block16, [64]);
>      int luma_dc_quant;
> @@ -1132,10 +1133,11 @@ static int rv34_decode_macroblock(RV34DecContext *r, 
> int8_t *intra_types)
>          return -1;
>  
>      luma_dc_quant = r->block_type == RV34_MB_P_MIX16x16 ? 
> r->luma_dc_quant_p[s->qscale] : r->luma_dc_quant_i[s->qscale];
> +    q_ac = rv34_qscale_tab[s->qscale];
>      if(r->is16){
>          memset(block16, 0, 64 * sizeof(*block16));
>          rv34_decode_block(block16, gb, r->cur_vlcs, 3, 0);
> -        rv34_dequant4x4_16x16(block16, 
> rv34_qscale_tab[luma_dc_quant],rv34_qscale_tab[s->qscale]);
> +        rv34_dequant4x4_16x16(block16, rv34_qscale_tab[luma_dc_quant], q_ac);
>          r->rdsp.rv34_inv_transform_tab[1](block16);
>      }
>  
> @@ -1146,7 +1148,7 @@ static int rv34_decode_macroblock(RV34DecContext *r, 
> int8_t *intra_types)
>          if(cbp & 1)
>              rv34_decode_block(s->block[blknum] + blkoff, gb, r->cur_vlcs, 
> r->luma_vlc, 0);
>          //{START_TIMER
> -        r->rdsp.rv34_dequant4x4(s->block[blknum] + blkoff, 
> rv34_qscale_tab[s->qscale],rv34_qscale_tab[s->qscale]);
> +        r->rdsp.rv34_dequant4x4(s->block[blknum] + blkoff, q_ac, q_ac);
>          //STOP_TIMER("first")}
>          if(r->is16) //FIXME: optimize
>              s->block[blknum][blkoff] = block16[(i & 3) | ((i & 0xC) << 1)];
> @@ -1154,13 +1156,15 @@ static int rv34_decode_macroblock(RV34DecContext *r, 
> int8_t *intra_types)
>      }
>      if(r->block_type == RV34_MB_P_MIX16x16)
>          r->cur_vlcs = choose_vlc_set(r->si.quant, r->si.vlc_set, 1);
> +    q_dc = rv34_qscale_tab[rv34_chroma_quant[0][s->qscale]];
> +    q_ac = rv34_qscale_tab[rv34_chroma_quant[1][s->qscale]];
>      for(; i < 24; i++, cbp >>= 1){
>          if(!(cbp & 1)) continue;
>          blknum = ((i & 4) >> 2) + 4;
>          blkoff = ((i & 1) << 2) + ((i & 2) << 4);
>          rv34_decode_block(s->block[blknum] + blkoff, gb, r->cur_vlcs, 
> r->chroma_vlc, 1);
>          //{START_TIMER
> -        r->rdsp.rv34_dequant4x4(s->block[blknum] + blkoff, 
> rv34_qscale_tab[rv34_chroma_quant[1][s->qscale]],rv34_qscale_tab[rv34_chroma_quant[0][s->qscale]]);
> +        r->rdsp.rv34_dequant4x4(s->block[blknum] + blkoff, q_ac, q_dc);
>          //STOP_TIMER("second")}
>          r->rdsp.rv34_inv_transform_tab[0](s->block[blknum] + blkoff);
>      }

This change looks OK

> diff --git a/libavcodec/rv34.c b/libavcodec/rv34.c
> index 6f3c363..6cf839e 100644
> --- a/libavcodec/rv34.c
> +++ b/libavcodec/rv34.c
> @@ -230,6 +230,15 @@ static inline void decode_coeff(DCTELEM *dst, int coef, 
> int esc, GetBitContext *
>  }
>  
>  /**
> + * Decode a single coefficient.
> + */
> +static inline void decode_subblock1(DCTELEM *dst, int code, GetBitContext 
> *gb, VLC *vlc)
> +{
> +  int coeff = modulo_three_table[code][0];
> +  decode_coeff(dst, coeff, 3, gb, vlc);
> +}
> +
> +/**
>   * Decode 2x2 subblock of coefficients.
>   */
>  static inline void decode_subblock(DCTELEM *dst, int code, const int 
> is_block2, GetBitContext *gb, VLC *vlc)
> @@ -262,16 +271,22 @@ static inline void decode_subblock(DCTELEM *dst, int 
> code, const int is_block2,
>   *  o--o
>   */
>  
> -static inline void rv34_decode_block(DCTELEM *dst, GetBitContext *gb, 
> RV34VLC *rvlc, int fc, int sc)
> +static inline int rv34_decode_block(DCTELEM *dst, GetBitContext *gb, RV34VLC 
> *rvlc, int fc, int sc)
>  {
> -    int code, pattern;
> +    int code, pattern, has_ac = 1;
>  
>      code = get_vlc2(gb, rvlc->first_pattern[fc].table, 9, 2);
>  
>      pattern = code & 0x7;
>  
>      code >>= 3;
> -    decode_subblock(dst, code, 0, gb, &rvlc->coefficient);
> +    if (!code || code==27 || code==54 || code==81) {   
> +      decode_subblock1(dst, code, gb, &rvlc->coefficient);
> +      if (!pattern)
> +        return 0;
> +      has_ac = 0;
> +    } else
> +      decode_subblock(dst, code, 0, gb, &rvlc->coefficient);
>  
>      if(pattern & 4){
>          code = get_vlc2(gb, rvlc->second_pattern[sc].table, 9, 2);
> @@ -285,7 +300,7 @@ static inline void rv34_decode_block(DCTELEM *dst, 
> GetBitContext *gb, RV34VLC *r
>          code = get_vlc2(gb, rvlc->third_pattern[sc].table, 9, 2);
>          decode_subblock(dst + 8*2+2, code, 0, gb, &rvlc->coefficient);
>      }
> -
> +    return pattern || has_ac;
>  }
>  
>  /**

weird indentation and does it really make any speed difference?

> diff --git a/libavcodec/rv34.c b/libavcodec/rv34.c
> index 6cf839e..4123c8c 100644
> --- a/libavcodec/rv34.c
> +++ b/libavcodec/rv34.c
> @@ -284,6 +284,7 @@ static inline int rv34_decode_block(DCTELEM *dst, 
> GetBitContext *gb, RV34VLC *rv
>        decode_subblock1(dst, code, gb, &rvlc->coefficient);
>        if (!pattern)
>          return 0;
> +      // Does happen

//So what?

>        has_ac = 0;
>      } else
>        decode_subblock(dst, code, 0, gb, &rvlc->coefficient);
> @@ -1160,28 +1161,41 @@ static int rv34_decode_macroblock(RV34DecContext *r, 
> int8_t *intra_types)
>          if(!r->is16 && !(cbp & 1)) continue;
>          blknum = ((i & 2) >> 1) + ((i & 8) >> 2);
>          blkoff = ((i & 1) << 2) + ((i & 4) << 3);
> -        if(cbp & 1)
> +        if(cbp & 1) {
>              rv34_decode_block(s->block[blknum] + blkoff, gb, r->cur_vlcs, 
> r->luma_vlc, 0);
> -        //{START_TIMER
> -        r->rdsp.rv34_dequant4x4(s->block[blknum] + blkoff, q_ac, q_ac);
> -        //STOP_TIMER("first")}
> -        if(r->is16) //FIXME: optimize
> -            s->block[blknum][blkoff] = block16[(i & 3) | ((i & 0xC) << 1)];
> -        r->rdsp.rv34_inv_transform_tab[0](s->block[blknum] + blkoff);
> +            //START_TIMER
> +            r->rdsp.rv34_dequant4x4(s->block[blknum] + blkoff, q_ac, q_ac);
> +            //STOP_TIMER("first")
> +            if(r->is16) //FIXME: optimize
> +                s->block[blknum][blkoff] = block16[(i & 3) | ((i & 0xC) << 
> 1)];
> +            r->rdsp.rv34_inv_transform_tab[0](s->block[blknum] + blkoff);
> +        } else if(r->is16) {
> +            // This does happen

//Indeed it does, so what?

> +            r->rdsp.rv34_idct_dequant4x4_dc(s->block[blknum] + blkoff,
> +                                            block16[(i & 3) | ((i & 0xC) << 
> 1)]);
> +        } else {
> +            DCTELEM *ptr = s->block[blknum] + blkoff;
> +            r->rdsp.rv34_idct_dequant4x4_dc(ptr, (ptr[0] * q_ac + 8) >> 4);
> +        }
>      }
>      if(r->block_type == RV34_MB_P_MIX16x16)
>          r->cur_vlcs = choose_vlc_set(r->si.quant, r->si.vlc_set, 1);
> -    q_dc = rv34_qscale_tab[rv34_chroma_quant[0][s->qscale]];
> -    q_ac = rv34_qscale_tab[rv34_chroma_quant[1][s->qscale]];
> +    q_ac = rv34_qscale_tab[rv34_chroma_quant[0][s->qscale]];
> +    q_dc = rv34_qscale_tab[rv34_chroma_quant[1][s->qscale]];
>      for(; i < 24; i++, cbp >>= 1){
>          if(!(cbp & 1)) continue;
>          blknum = ((i & 4) >> 2) + 4;
>          blkoff = ((i & 1) << 2) + ((i & 2) << 4);
>          has_ac = rv34_decode_block(s->block[blknum] + blkoff, gb, 
> r->cur_vlcs, r->chroma_vlc, 1);
> -        //{START_TIMER
> -        r->rdsp.rv34_dequant4x4(s->block[blknum] + blkoff, q_ac, q_dc);
> -        //STOP_TIMER("second")}
> -        r->rdsp.rv34_inv_transform_tab[0](s->block[blknum] + blkoff);
> +        if (has_ac) {
> +            //START_TIMER
> +            r->rdsp.rv34_dequant4x4(s->block[blknum] + blkoff, q_dc, q_ac);
> +            //STOP_TIMER("second")
> +            r->rdsp.rv34_inv_transform_tab[0](s->block[blknum] + blkoff);
> +        } else {
> +            DCTELEM *ptr = s->block[blknum] + blkoff;
> +            r->rdsp.rv34_idct_dequant4x4_dc(ptr, (ptr[0] * q_dc + 8) >> 4);
> +        }
>      }
>      if (IS_INTRA(s->current_picture_ptr->f.mb_type[mb_pos]))
>          rv34_output_macroblock(r, intra_types, cbp2, r->is16);

Approach seems good though.

> diff --git a/libavcodec/rv34dsp.c b/libavcodec/rv34dsp.c
> index 302cb17..3fcf4ae 100644
> --- a/libavcodec/rv34dsp.c
> +++ b/libavcodec/rv34dsp.c
> @@ -114,11 +114,22 @@ static void rv34_dequant4x4_c(DCTELEM *block, int Qdc, 
> int Q)
>              block[j + i*8] = (block[j + i*8] * Q + 8) >> 4;
>  }
>  
> +static void rv34_idct_dequant4x4_dc_c(DCTELEM *block, int dc)
> +{
> +    int i, j;
> +
> +    dc = ( 13*13*dc  + 0x200 ) >> 10;

weird formatting

> +    for (i = 0; i < 4; i++, block += 8)
> +        for (j = 0; j < 4; j++)
> +            block[j] = dc;
> +}
> +
>  av_cold void ff_rv34dsp_init(RV34DSPContext *c, DSPContext* dsp) {
>      c->rv34_inv_transform_tab[0] = rv34_inv_transform_c;
>      c->rv34_inv_transform_tab[1] = rv34_inv_transform_noround_c;
>  
>      c->rv34_dequant4x4 = rv34_dequant4x4_c;
> +    c->rv34_idct_dequant4x4_dc = rv34_idct_dequant4x4_dc_c;
>  
>      if (HAVE_NEON)
>          ff_rv34dsp_init_neon(c, dsp);

the rest seems passable.


P.S. Maybe you'll send it next time at least in a separate batches, tracking
different patch changes would be hard for us both.
_______________________________________________
libav-devel mailing list
[email protected]
https://lists.libav.org/mailman/listinfo/libav-devel

Re: [libav-devel] [PATCH] Various dequant optimizations for RV30/40

Reply via email to