Re: [FFmpeg-devel] [PATCH 1/3] sbc: implement SBC codec (low-complexity subband codec)

Aurelien Jacobs Sun, 17 Dec 2017 13:45:07 -0800

On Mon, Nov 06, 2017 at 04:40:56AM +0000, Rostislav Pehlivanov wrote:
> On 5 November 2017 at 23:35, Aurelien Jacobs <au...@gnuage.org> wrote:
> 
> > This was originally based on libsbc, and was fully integrated into ffmpeg.
> > ---
> >  doc/general.texi                 |   2 +
> >  libavcodec/Makefile              |   4 +
> >  libavcodec/allcodecs.c           |   2 +
> >  libavcodec/arm/Makefile          |   3 +
> >  libavcodec/arm/sbcdsp_armv6.S    | 245 ++++++++++++++
> >  libavcodec/arm/sbcdsp_init_arm.c | 105 ++++++
> >  libavcodec/arm/sbcdsp_neon.S     | 714 ++++++++++++++++++++++++++++++
> > +++++++++
> >  libavcodec/avcodec.h             |   2 +
> >  libavcodec/codec_desc.c          |  12 +
> >  libavcodec/sbc.c                 | 316 +++++++++++++++++
> >  libavcodec/sbc.h                 | 121 +++++++
> >  libavcodec/sbcdec.c              | 469 +++++++++++++++++++++++++
> >  libavcodec/sbcdec_data.c         | 127 +++++++
> >  libavcodec/sbcdec_data.h         |  44 +++
> >  libavcodec/sbcdsp.c              | 569 +++++++++++++++++++++++++++++++
> >  libavcodec/sbcdsp.h              |  86 +++++
> >  libavcodec/sbcdsp_data.c         | 335 ++++++++++++++++++
> >  libavcodec/sbcdsp_data.h         |  57 ++++
> >  libavcodec/sbcenc.c              | 461 +++++++++++++++++++++++++
> >  libavcodec/x86/Makefile          |   2 +
> >  libavcodec/x86/sbcdsp.asm        | 290 ++++++++++++++++
> >  libavcodec/x86/sbcdsp_init.c     |  51 +++
> >  22 files changed, 4017 insertions(+)
> >  create mode 100644 libavcodec/arm/sbcdsp_armv6.S
> >  create mode 100644 libavcodec/arm/sbcdsp_init_arm.c
> >  create mode 100644 libavcodec/arm/sbcdsp_neon.S
> >  create mode 100644 libavcodec/sbc.c
> >  create mode 100644 libavcodec/sbc.h
> >  create mode 100644 libavcodec/sbcdec.c
> >  create mode 100644 libavcodec/sbcdec_data.c
> >  create mode 100644 libavcodec/sbcdec_data.h
> >  create mode 100644 libavcodec/sbcdsp.c
> >  create mode 100644 libavcodec/sbcdsp.h
> >  create mode 100644 libavcodec/sbcdsp_data.c
> >  create mode 100644 libavcodec/sbcdsp_data.h
> >  create mode 100644 libavcodec/sbcenc.c
> >  create mode 100644 libavcodec/x86/sbcdsp.asm
> >  create mode 100644 libavcodec/x86/sbcdsp_init.c
> >
> > diff --git a/libavcodec/avcodec.h b/libavcodec/avcodec.h
> > index c4134424f0..2d541bf64a 100644
> > --- a/libavcodec/avcodec.h
> > +++ b/libavcodec/avcodec.h
> > @@ -632,6 +632,8 @@ enum AVCodecID {
> >      AV_CODEC_ID_ATRAC3AL,
> >      AV_CODEC_ID_ATRAC3PAL,
> >      AV_CODEC_ID_DOLBY_E,
> > +    AV_CODEC_ID_SBC,
> > +    AV_CODEC_ID_MSBC,
> >
> >
> See below.
> 
> 
> >      /* subtitle codecs */
> >      AV_CODEC_ID_FIRST_SUBTITLE = 0x17000,          ///< A dummy ID
> > pointing at the start of subtitle codecs.
> > diff --git a/libavcodec/codec_desc.c b/libavcodec/codec_desc.c
> > index 92bf1d2681..8d613507e0 100644
> > --- a/libavcodec/codec_desc.c
> > +++ b/libavcodec/codec_desc.c
> > @@ -2859,6 +2859,18 @@ static const AVCodecDescriptor codec_descriptors[]
> > = {
> >          .long_name = NULL_IF_CONFIG_SMALL("ADPCM MTAF"),
> >          .props     = AV_CODEC_PROP_LOSSY,
> >      },
> > +    {
> > +        .id        = AV_CODEC_ID_SBC,
> > +        .type      = AVMEDIA_TYPE_AUDIO,
> > +        .name      = "sbc",
> > +        .long_name = NULL_IF_CONFIG_SMALL("SBC (low-complexity subband
> > codec)"),
> > +    },
> > +    {
> > +        .id        = AV_CODEC_ID_MSBC,
> > +        .type      = AVMEDIA_TYPE_AUDIO,
> > +        .name      = "msbc",
> > +        .long_name = NULL_IF_CONFIG_SMALL("mSBC (wideband speech mono
> > SBC)"),
> > +    },
> >
> 
> Is there a bitstream difference between the two? I don't think so, so you
> should instead define FF_PROFILE_SBC_WB and use a single codec ID.


SBC support various samplerates while mSBC is limited to 16 kHz.
I think the only way to declare this properly and to get automatic
conversion to 16 kHz when encoding to mSBC is to have 2 separate
codec ID.
So I kept the 2 separate codec ID.

> > diff --git a/libavcodec/sbc.c b/libavcodec/sbc.c
> > new file mode 100644
> > index 0000000000..99d02cc56a
> > --- /dev/null
> > +++ b/libavcodec/sbc.c
> > @@ -0,0 +1,316 @@
> > +/*
> > + * Bluetooth low-complexity, subband codec (SBC)
> > + *
> > + * Copyright (C) 2017  Aurelien Jacobs <au...@gnuage.org>
> > + * Copyright (C) 2012-2013  Intel Corporation
> > + * Copyright (C) 2008-2010  Nokia Corporation
> > + * Copyright (C) 2004-2010  Marcel Holtmann <mar...@holtmann.org>
> > + * Copyright (C) 2004-2005  Henryk Ploetz <hen...@ploetzli.ch>
> > + * Copyright (C) 2005-2008  Brad Midgley <bmidg...@xmission.com>
> > + *
> > + * This file is part of FFmpeg.
> > + *
> > + * FFmpeg is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU Lesser General Public
> > + * License as published by the Free Software Foundation; either
> > + * version 2.1 of the License, or (at your option) any later version.
> > + *
> > + * FFmpeg is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > + * Lesser General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU Lesser General Public
> > + * License along with FFmpeg; if not, write to the Free Software
> > + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
> > 02110-1301 USA
> > + */
> > +
> > +/**
> > + * @file
> > + * SBC common functions for the encoder and decoder
> > + */
> > +
> > +#include "avcodec.h"
> > +#include "sbc.h"
> > +
> > +/*
> > + * Calculates the CRC-8 of the first len bits in data
> > + */
> > +static const uint8_t crc_table[256] = {
> > +    0x00, 0x1D, 0x3A, 0x27, 0x74, 0x69, 0x4E, 0x53,
> > +    0xE8, 0xF5, 0xD2, 0xCF, 0x9C, 0x81, 0xA6, 0xBB,
> > +    0xCD, 0xD0, 0xF7, 0xEA, 0xB9, 0xA4, 0x83, 0x9E,
> > +    0x25, 0x38, 0x1F, 0x02, 0x51, 0x4C, 0x6B, 0x76,
> > +    0x87, 0x9A, 0xBD, 0xA0, 0xF3, 0xEE, 0xC9, 0xD4,
> > +    0x6F, 0x72, 0x55, 0x48, 0x1B, 0x06, 0x21, 0x3C,
> > +    0x4A, 0x57, 0x70, 0x6D, 0x3E, 0x23, 0x04, 0x19,
> > +    0xA2, 0xBF, 0x98, 0x85, 0xD6, 0xCB, 0xEC, 0xF1,
> > +    0x13, 0x0E, 0x29, 0x34, 0x67, 0x7A, 0x5D, 0x40,
> > +    0xFB, 0xE6, 0xC1, 0xDC, 0x8F, 0x92, 0xB5, 0xA8,
> > +    0xDE, 0xC3, 0xE4, 0xF9, 0xAA, 0xB7, 0x90, 0x8D,
> > +    0x36, 0x2B, 0x0C, 0x11, 0x42, 0x5F, 0x78, 0x65,
> > +    0x94, 0x89, 0xAE, 0xB3, 0xE0, 0xFD, 0xDA, 0xC7,
> > +    0x7C, 0x61, 0x46, 0x5B, 0x08, 0x15, 0x32, 0x2F,
> > +    0x59, 0x44, 0x63, 0x7E, 0x2D, 0x30, 0x17, 0x0A,
> > +    0xB1, 0xAC, 0x8B, 0x96, 0xC5, 0xD8, 0xFF, 0xE2,
> > +    0x26, 0x3B, 0x1C, 0x01, 0x52, 0x4F, 0x68, 0x75,
> > +    0xCE, 0xD3, 0xF4, 0xE9, 0xBA, 0xA7, 0x80, 0x9D,
> > +    0xEB, 0xF6, 0xD1, 0xCC, 0x9F, 0x82, 0xA5, 0xB8,
> > +    0x03, 0x1E, 0x39, 0x24, 0x77, 0x6A, 0x4D, 0x50,
> > +    0xA1, 0xBC, 0x9B, 0x86, 0xD5, 0xC8, 0xEF, 0xF2,
> > +    0x49, 0x54, 0x73, 0x6E, 0x3D, 0x20, 0x07, 0x1A,
> > +    0x6C, 0x71, 0x56, 0x4B, 0x18, 0x05, 0x22, 0x3F,
> > +    0x84, 0x99, 0xBE, 0xA3, 0xF0, 0xED, 0xCA, 0xD7,
> > +    0x35, 0x28, 0x0F, 0x12, 0x41, 0x5C, 0x7B, 0x66,
> > +    0xDD, 0xC0, 0xE7, 0xFA, 0xA9, 0xB4, 0x93, 0x8E,
> > +    0xF8, 0xE5, 0xC2, 0xDF, 0x8C, 0x91, 0xB6, 0xAB,
> > +    0x10, 0x0D, 0x2A, 0x37, 0x64, 0x79, 0x5E, 0x43,
> > +    0xB2, 0xAF, 0x88, 0x95, 0xC6, 0xDB, 0xFC, 0xE1,
> > +    0x5A, 0x47, 0x60, 0x7D, 0x2E, 0x33, 0x14, 0x09,
> > +    0x7F, 0x62, 0x45, 0x58, 0x0B, 0x16, 0x31, 0x2C,
> > +    0x97, 0x8A, 0xAD, 0xB0, 0xE3, 0xFE, 0xD9, 0xC4
> > +};
> > +
> > +uint8_t ff_sbc_crc8(const uint8_t *data, size_t len)
> > +{
> > +    uint8_t crc = 0x0f;
> > +    size_t i;
> > +    uint8_t octet;
> > +
> > +    for (i = 0; i < len / 8; i++)
> > +        crc = crc_table[crc ^ data[i]];
> > +
> > +    octet = data[i];
> > +    for (i = 0; i < len % 8; i++) {
> > +        char bit = ((octet ^ crc) & 0x80) >> 7;
> > +
> > +        crc = ((crc & 0x7f) << 1) ^ (bit ? 0x1d : 0);
> > +
> > +        octet = octet << 1;
> > +    }
> > +
> > +    return crc;
> > +}
> >
> 
> 
> We have CRC functions already, look in libavutil/crc.h

I know this and I tried to use them but I couldn't get them to behave
the same as this ff_sbc_crc8.

> > +                        if (subbands == 4)
> > +                            loudness = frame->scale_factor[ch][sb] -
> > sbc_offset4[sf][sb];
> > +                        else
> > +                            loudness = frame->scale_factor[ch][sb] -
> > sbc_offset8[sf][sb];
> > +                        if (loudness > 0)
> > +                            bitneed[ch][sb] = loudness / 2;
> > +                        else
> > +                            bitneed[ch][sb] = loudness;
> >
> 
> 
> bitneed[ch][sb] = loudness >> (loudness > 0);

OK.

> > +
> > +static int sbc_decode_frame(AVCodecContext *avctx,
> > +                            void *data, int *got_frame_ptr,
> > +                            AVPacket *avpkt)
> > +{
> > +    SBCDecContext *sbc = avctx->priv_data;
> > +    int i, ch, samples, ret;
> > +    AVFrame *frame = data;
> > +    int16_t *ptr;
> > +
> > +    if (!sbc)
> > +        return AVERROR(EIO);
> > +
> > +    sbc->frame.length = sbc->unpack_frame(avpkt->data, &sbc->frame,
> > avpkt->size);
> > +    if (sbc->frame.length <= 0)
> > +        return sbc->frame.length;
> > +
> > +    samples = sbc_synthesize_audio(&sbc->dsp, &sbc->frame);
> > +
> > +    frame->nb_samples = samples;
> > +    if ((ret = ff_get_buffer(avctx, frame, 0)) < 0)
> > +        return ret;
> > +    ptr = (int16_t *)frame->data[0];
> > +
> > +    for (i = 0; i < samples; i++)
> > +        for (ch = 0; ch < sbc->frame.channels; ch++)
> > +            *ptr++ = sbc->frame.pcm_sample[ch][i];
> > +
> >
> 
> Once again, use planar sample formats

Done.

> > +    *got_frame_ptr = 1;
> > +
> > +    return sbc->frame.length;
> > +}
> > +
> > +#if CONFIG_SBC_DECODER
> > +AVCodec ff_sbc_decoder = {
> > +    .name                  = "sbc",
> > +    .long_name             = NULL_IF_CONFIG_SMALL("SBC (low-complexity
> > subband codec)"),
> > +    .type                  = AVMEDIA_TYPE_AUDIO,
> > +    .id                    = AV_CODEC_ID_SBC,
> > +    .priv_data_size        = sizeof(SBCDecContext),
> > +    .init                  = sbc_decode_init,
> > +    .decode                = sbc_decode_frame,
> > +    .capabilities          = AV_CODEC_CAP_DR1,
> > +    .channel_layouts       = (const uint64_t[]) { AV_CH_LAYOUT_MONO,
> > +                                                  AV_CH_LAYOUT_STEREO, 0},
> > +    .sample_fmts           = (const enum AVSampleFormat[]) {
> > AV_SAMPLE_FMT_S16,
> >
> 
> 
> AV_SAMPLE_FMT_S16P

Done.

> > +
> >  AV_SAMPLE_FMT_NONE },
> > +    .supported_samplerates = (const int[]) { 16000, 32000, 44100, 48000,
> > 0 },
> > +};
> > +#endif
> > +
> > +#if CONFIG_MSBC_DECODER
> > +AVCodec ff_msbc_decoder = {
> > +    .name                  = "msbc",
> > +    .long_name             = NULL_IF_CONFIG_SMALL("mSBC (wideband speech
> > mono SBC)"),
> > +    .type                  = AVMEDIA_TYPE_AUDIO,
> > +    .id                    = AV_CODEC_ID_MSBC,
> > +    .priv_data_size        = sizeof(SBCDecContext),
> > +    .init                  = msbc_decode_init,
> > +    .decode                = sbc_decode_frame,
> > +    .capabilities          = AV_CODEC_CAP_DR1,
> > +    .channel_layouts       = (const uint64_t[]) { AV_CH_LAYOUT_MONO, 0},
> > +    .sample_fmts           = (const enum AVSampleFormat[]) {
> > AV_SAMPLE_FMT_S16,
> >
> 
> 
> AV_SAMPLE_FMT_S16P

Done.

> > +
> > +/*
> > + * A reference C code of analysis filter with SIMD-friendly tables
> > + * reordering and code layout. This code can be used to develop platform
> > + * specific SIMD optimizations. Also it may be used as some kind of test
> > + * for compiler autovectorization capabilities (who knows, if the compiler
> > + * is very good at this stuff, hand optimized assembly may be not strictly
> > + * needed for some platform).
> > + *
> > + * Note: It is also possible to make a simple variant of analysis filter,
> > + * which needs only a single constants table without taking care about
> > + * even/odd cases. This simple variant of filter can be implemented
> > without
> > + * input data permutation. The only thing that would be lost is the
> > + * possibility to use pairwise SIMD multiplications. But for some simple
> > + * CPU cores without SIMD extensions it can be useful. If anybody is
> > + * interested in implementing such variant of a filter, sourcecode from
> > + * bluez versions 4.26/4.27 can be used as a reference and the history of
> > + * the changes in git repository done around that time may be worth
> > checking.
> > + */
> > +
> > +static void sbc_analyze_4_simd(const int16_t *in, int32_t *out,
> > +                               const int16_t *consts)
> > +{
> > +    int32_t t1[4];
> > +    int16_t t2[4];
> > +    int hop = 0;
> > +
> > +    /* rounding coefficient */
> > +    t1[0] = t1[1] = t1[2] = t1[3] =
> > +        (int32_t) 1 << (SBC_PROTO_FIXED4_SCALE - 1);
> > +
> > +    /* low pass polyphase filter */
> > +    for (hop = 0; hop < 40; hop += 8) {
> > +        t1[0] += (int32_t) in[hop] * consts[hop];
> > +        t1[0] += (int32_t) in[hop + 1] * consts[hop + 1];
> > +        t1[1] += (int32_t) in[hop + 2] * consts[hop + 2];
> > +        t1[1] += (int32_t) in[hop + 3] * consts[hop + 3];
> > +        t1[2] += (int32_t) in[hop + 4] * consts[hop + 4];
> > +        t1[2] += (int32_t) in[hop + 5] * consts[hop + 5];
> > +        t1[3] += (int32_t) in[hop + 6] * consts[hop + 6];
> > +        t1[3] += (int32_t) in[hop + 7] * consts[hop + 7];
> > +    }
> > +
> > +    /* scaling */
> > +    t2[0] = t1[0] >> SBC_PROTO_FIXED4_SCALE;
> > +    t2[1] = t1[1] >> SBC_PROTO_FIXED4_SCALE;
> > +    t2[2] = t1[2] >> SBC_PROTO_FIXED4_SCALE;
> > +    t2[3] = t1[3] >> SBC_PROTO_FIXED4_SCALE;
> > +
> > +    /* do the cos transform */
> > +    t1[0]  = (int32_t) t2[0] * consts[40 + 0];
> > +    t1[0] += (int32_t) t2[1] * consts[40 + 1];
> > +    t1[1]  = (int32_t) t2[0] * consts[40 + 2];
> > +    t1[1] += (int32_t) t2[1] * consts[40 + 3];
> > +    t1[2]  = (int32_t) t2[0] * consts[40 + 4];
> > +    t1[2] += (int32_t) t2[1] * consts[40 + 5];
> > +    t1[3]  = (int32_t) t2[0] * consts[40 + 6];
> > +    t1[3] += (int32_t) t2[1] * consts[40 + 7];
> > +
> > +    t1[0] += (int32_t) t2[2] * consts[40 + 8];
> > +    t1[0] += (int32_t) t2[3] * consts[40 + 9];
> > +    t1[1] += (int32_t) t2[2] * consts[40 + 10];
> > +    t1[1] += (int32_t) t2[3] * consts[40 + 11];
> > +    t1[2] += (int32_t) t2[2] * consts[40 + 12];
> > +    t1[2] += (int32_t) t2[3] * consts[40 + 13];
> > +    t1[3] += (int32_t) t2[2] * consts[40 + 14];
> > +    t1[3] += (int32_t) t2[3] * consts[40 + 15];
> > +
> > +    out[0] = t1[0] >>
> > +        (SBC_COS_TABLE_FIXED4_SCALE - SCALE_OUT_BITS);
> > +    out[1] = t1[1] >>
> > +        (SBC_COS_TABLE_FIXED4_SCALE - SCALE_OUT_BITS);
> > +    out[2] = t1[2] >>
> > +        (SBC_COS_TABLE_FIXED4_SCALE - SCALE_OUT_BITS);
> > +    out[3] = t1[3] >>
> > +        (SBC_COS_TABLE_FIXED4_SCALE - SCALE_OUT_BITS);
> > +}
> > +
> > +static void sbc_analyze_8_simd(const int16_t *in, int32_t *out,
> > +                               const int16_t *consts)
> > +{
> > +    int32_t t1[8];
> > +    int16_t t2[8];
> > +    int i, hop;
> > +
> > +    /* rounding coefficient */
> > +    t1[0] = t1[1] = t1[2] = t1[3] = t1[4] = t1[5] = t1[6] = t1[7] =
> > +        (int32_t) 1 << (SBC_PROTO_FIXED8_SCALE-1);
> > +
> > +    /* low pass polyphase filter */
> > +    for (hop = 0; hop < 80; hop += 16) {
> > +        t1[0] += (int32_t) in[hop] * consts[hop];
> > +        t1[0] += (int32_t) in[hop + 1] * consts[hop + 1];
> > +        t1[1] += (int32_t) in[hop + 2] * consts[hop + 2];
> > +        t1[1] += (int32_t) in[hop + 3] * consts[hop + 3];
> > +        t1[2] += (int32_t) in[hop + 4] * consts[hop + 4];
> > +        t1[2] += (int32_t) in[hop + 5] * consts[hop + 5];
> > +        t1[3] += (int32_t) in[hop + 6] * consts[hop + 6];
> > +        t1[3] += (int32_t) in[hop + 7] * consts[hop + 7];
> > +        t1[4] += (int32_t) in[hop + 8] * consts[hop + 8];
> > +        t1[4] += (int32_t) in[hop + 9] * consts[hop + 9];
> > +        t1[5] += (int32_t) in[hop + 10] * consts[hop + 10];
> > +        t1[5] += (int32_t) in[hop + 11] * consts[hop + 11];
> > +        t1[6] += (int32_t) in[hop + 12] * consts[hop + 12];
> > +        t1[6] += (int32_t) in[hop + 13] * consts[hop + 13];
> > +        t1[7] += (int32_t) in[hop + 14] * consts[hop + 14];
> > +        t1[7] += (int32_t) in[hop + 15] * consts[hop + 15];
> > +    }
> > +
> > +    /* scaling */
> > +    t2[0] = t1[0] >> SBC_PROTO_FIXED8_SCALE;
> > +    t2[1] = t1[1] >> SBC_PROTO_FIXED8_SCALE;
> > +    t2[2] = t1[2] >> SBC_PROTO_FIXED8_SCALE;
> > +    t2[3] = t1[3] >> SBC_PROTO_FIXED8_SCALE;
> > +    t2[4] = t1[4] >> SBC_PROTO_FIXED8_SCALE;
> > +    t2[5] = t1[5] >> SBC_PROTO_FIXED8_SCALE;
> > +    t2[6] = t1[6] >> SBC_PROTO_FIXED8_SCALE;
> > +    t2[7] = t1[7] >> SBC_PROTO_FIXED8_SCALE;
> > +
> > +
> > +    /* do the cos transform */
> > +    t1[0] = t1[1] = t1[2] = t1[3] = t1[4] = t1[5] = t1[6] = t1[7] = 0;
> > +
> > +    for (i = 0; i < 4; i++) {
> > +        t1[0] += (int32_t) t2[i * 2 + 0] * consts[80 + i * 16 + 0];
> > +        t1[0] += (int32_t) t2[i * 2 + 1] * consts[80 + i * 16 + 1];
> > +        t1[1] += (int32_t) t2[i * 2 + 0] * consts[80 + i * 16 + 2];
> > +        t1[1] += (int32_t) t2[i * 2 + 1] * consts[80 + i * 16 + 3];
> > +        t1[2] += (int32_t) t2[i * 2 + 0] * consts[80 + i * 16 + 4];
> > +        t1[2] += (int32_t) t2[i * 2 + 1] * consts[80 + i * 16 + 5];
> > +        t1[3] += (int32_t) t2[i * 2 + 0] * consts[80 + i * 16 + 6];
> > +        t1[3] += (int32_t) t2[i * 2 + 1] * consts[80 + i * 16 + 7];
> > +        t1[4] += (int32_t) t2[i * 2 + 0] * consts[80 + i * 16 + 8];
> > +        t1[4] += (int32_t) t2[i * 2 + 1] * consts[80 + i * 16 + 9];
> > +        t1[5] += (int32_t) t2[i * 2 + 0] * consts[80 + i * 16 + 10];
> > +        t1[5] += (int32_t) t2[i * 2 + 1] * consts[80 + i * 16 + 11];
> > +        t1[6] += (int32_t) t2[i * 2 + 0] * consts[80 + i * 16 + 12];
> > +        t1[6] += (int32_t) t2[i * 2 + 1] * consts[80 + i * 16 + 13];
> > +        t1[7] += (int32_t) t2[i * 2 + 0] * consts[80 + i * 16 + 14];
> > +        t1[7] += (int32_t) t2[i * 2 + 1] * consts[80 + i * 16 + 15];
> > +    }
> > +
> > +    for (i = 0; i < 8; i++)
> > +        out[i] = t1[i] >>
> > +            (SBC_COS_TABLE_FIXED8_SCALE - SCALE_OUT_BITS);
> > +}
> > +
> >
> 
> 
> What does it do here? A PQF into an FFT?
> I might investigate using lavc's fixed point mdct for this maybe, I hate
> custom fixed-point analysis transforms.

Sure, have a go. It would be great if it could be used.

> > +static inline void sbc_analyze_4b_4s_simd(SBCDSPContext *s,
> > +                                          int16_t *x, int32_t *out, int
> > out_stride)
> > +{
> > +    /* Analyze blocks */
> > +    s->sbc_analyze_4(x + 12, out, ff_sbcdsp_analysis_consts_
> > fixed4_simd_odd);
> > +    out += out_stride;
> > +    s->sbc_analyze_4(x + 8, out, ff_sbcdsp_analysis_consts_
> > fixed4_simd_even);
> > +    out += out_stride;
> > +    s->sbc_analyze_4(x + 4, out, ff_sbcdsp_analysis_consts_
> > fixed4_simd_odd);
> > +    out += out_stride;
> > +    s->sbc_analyze_4(x + 0, out, ff_sbcdsp_analysis_consts_
> > fixed4_simd_even);
> > +
> > +    emms_c();
> > +}
> > +
> > +static inline void sbc_analyze_4b_8s_simd(SBCDSPContext *s,
> > +                                          int16_t *x, int32_t *out, int
> > out_stride)
> > +{
> > +    /* Analyze blocks */
> > +    s->sbc_analyze_8(x + 24, out, ff_sbcdsp_analysis_consts_
> > fixed8_simd_odd);
> > +    out += out_stride;
> > +    s->sbc_analyze_8(x + 16, out, ff_sbcdsp_analysis_consts_
> > fixed8_simd_even);
> > +    out += out_stride;
> > +    s->sbc_analyze_8(x + 8, out, ff_sbcdsp_analysis_consts_
> > fixed8_simd_odd);
> > +    out += out_stride;
> > +    s->sbc_analyze_8(x + 0, out, ff_sbcdsp_analysis_consts_
> > fixed8_simd_even);
> > +
> > +    emms_c();
> > +}
> > +
> > +static inline void sbc_analyze_1b_8s_simd_even(SBCDSPContext *s,
> > +                                               int16_t *x, int32_t *out,
> > +                                               int out_stride);
> > +
> > +static inline void sbc_analyze_1b_8s_simd_odd(SBCDSPContext *s,
> > +                                              int16_t *x, int32_t *out,
> > +                                              int out_stride)
> > +{
> > +    s->sbc_analyze_8(x, out, ff_sbcdsp_analysis_consts_fixed8_simd_odd);
> > +    s->sbc_analyze_8s = sbc_analyze_1b_8s_simd_even;
> > +
> > +    emms_c();
> > +}
> > +
> > +static inline void sbc_analyze_1b_8s_simd_even(SBCDSPContext *s,
> > +                                               int16_t *x, int32_t *out,
> > +                                               int out_stride)
> > +{
> > +    s->sbc_analyze_8(x, out, ff_sbcdsp_analysis_consts_fixed8_simd_even);
> > +    s->sbc_analyze_8s = sbc_analyze_1b_8s_simd_odd;
> > +
> > +    emms_c();
> > +}
> > +
> > +#define PCM(i)  AV_RN16(pcm + 2*(i))
> >
> 
> Don't use a define, just substitute it directly.

OK.

> > +
> > +/*
> > + * Internal helper functions for input data processing. In order to get
> > + * optimal performance, it is important to have "nsamples" and "nchannels"
> > + * arguments used with this inline function as compile time constants.
> > + */
> > +
> > +static av_always_inline int sbc_encoder_process_input_s4_internal(
> > +    int position, const uint8_t *pcm, int16_t X[2][SBC_X_BUFFER_SIZE],
> > +    int nsamples, int nchannels)
> > +{
> > +    /* handle X buffer wraparound */
> > +    if (position < nsamples) {
> > +        if (nchannels > 0)
> > +            memcpy(&X[0][SBC_X_BUFFER_SIZE - 40], &X[0][position],
> > +                            36 * sizeof(int16_t));
> > +        if (nchannels > 1)
> > +            memcpy(&X[1][SBC_X_BUFFER_SIZE - 40], &X[1][position],
> > +                            36 * sizeof(int16_t));
> > +        position = SBC_X_BUFFER_SIZE - 40;
> > +    }
> > +
> > +    /* copy/permutate audio samples */
> > +    while ((nsamples -= 8) >= 0) {
> > +        position -= 8;
> > +        if (nchannels > 0) {
> > +            int16_t *x = &X[0][position];
> > +            x[0]  = PCM(0 + 7 * nchannels);
> > +            x[1]  = PCM(0 + 3 * nchannels);
> > +            x[2]  = PCM(0 + 6 * nchannels);
> > +            x[3]  = PCM(0 + 4 * nchannels);
> > +            x[4]  = PCM(0 + 0 * nchannels);
> > +            x[5]  = PCM(0 + 2 * nchannels);
> > +            x[6]  = PCM(0 + 1 * nchannels);
> > +            x[7]  = PCM(0 + 5 * nchannels);
> > +        }
> > +        if (nchannels > 1) {
> > +            int16_t *x = &X[1][position];
> > +            x[0]  = PCM(1 + 7 * nchannels);
> > +            x[1]  = PCM(1 + 3 * nchannels);
> > +            x[2]  = PCM(1 + 6 * nchannels);
> > +            x[3]  = PCM(1 + 4 * nchannels);
> > +            x[4]  = PCM(1 + 0 * nchannels);
> > +            x[5]  = PCM(1 + 2 * nchannels);
> > +            x[6]  = PCM(1 + 1 * nchannels);
> > +            x[7]  = PCM(1 + 5 * nchannels);
> > +        }
> > +        pcm += 16 * nchannels;
> > +    }
> > +
> > +    return position;
> > +}
> > +
> > +static av_always_inline int sbc_encoder_process_input_s8_internal(
> > +    int position, const uint8_t *pcm, int16_t X[2][SBC_X_BUFFER_SIZE],
> > +    int nsamples, int nchannels)
> > +{
> > +    /* handle X buffer wraparound */
> > +    if (position < nsamples) {
> > +        if (nchannels > 0)
> > +            memcpy(&X[0][SBC_X_BUFFER_SIZE - 72], &X[0][position],
> > +                            72 * sizeof(int16_t));
> > +        if (nchannels > 1)
> > +            memcpy(&X[1][SBC_X_BUFFER_SIZE - 72], &X[1][position],
> > +                            72 * sizeof(int16_t));
> > +        position = SBC_X_BUFFER_SIZE - 72;
> > +    }
> > +
> > +    if (position % 16 == 8) {
> > +        position -= 8;
> > +        nsamples -= 8;
> > +        if (nchannels > 0) {
> > +            int16_t *x = &X[0][position];
> > +            x[0]  = PCM(0 + (15-8) * nchannels);
> > +            x[2]  = PCM(0 + (14-8) * nchannels);
> > +            x[3]  = PCM(0 + (8-8) * nchannels);
> > +            x[4]  = PCM(0 + (13-8) * nchannels);
> > +            x[5]  = PCM(0 + (9-8) * nchannels);
> > +            x[6]  = PCM(0 + (12-8) * nchannels);
> > +            x[7]  = PCM(0 + (10-8) * nchannels);
> > +            x[8]  = PCM(0 + (11-8) * nchannels);
> > +        }
> > +        if (nchannels > 1) {
> > +            int16_t *x = &X[1][position];
> > +            x[0]  = PCM(1 + (15-8) * nchannels);
> > +            x[2]  = PCM(1 + (14-8) * nchannels);
> > +            x[3]  = PCM(1 + (8-8) * nchannels);
> > +            x[4]  = PCM(1 + (13-8) * nchannels);
> > +            x[5]  = PCM(1 + (9-8) * nchannels);
> > +            x[6]  = PCM(1 + (12-8) * nchannels);
> > +            x[7]  = PCM(1 + (10-8) * nchannels);
> > +            x[8]  = PCM(1 + (11-8) * nchannels);
> > +        }
> > +
> > +        pcm += 16 * nchannels;
> > +    }
> > +
> > +    /* copy/permutate audio samples */
> > +    while (nsamples >= 16) {
> > +        position -= 16;
> > +        if (nchannels > 0) {
> > +            int16_t *x = &X[0][position];
> > +            x[0]  = PCM(0 + 15 * nchannels);
> > +            x[1]  = PCM(0 + 7 * nchannels);
> > +            x[2]  = PCM(0 + 14 * nchannels);
> > +            x[3]  = PCM(0 + 8 * nchannels);
> > +            x[4]  = PCM(0 + 13 * nchannels);
> > +            x[5]  = PCM(0 + 9 * nchannels);
> > +            x[6]  = PCM(0 + 12 * nchannels);
> > +            x[7]  = PCM(0 + 10 * nchannels);
> > +            x[8]  = PCM(0 + 11 * nchannels);
> > +            x[9]  = PCM(0 + 3 * nchannels);
> > +            x[10] = PCM(0 + 6 * nchannels);
> > +            x[11] = PCM(0 + 0 * nchannels);
> > +            x[12] = PCM(0 + 5 * nchannels);
> > +            x[13] = PCM(0 + 1 * nchannels);
> > +            x[14] = PCM(0 + 4 * nchannels);
> > +            x[15] = PCM(0 + 2 * nchannels);
> > +        }
> > +        if (nchannels > 1) {
> > +            int16_t *x = &X[1][position];
> > +            x[0]  = PCM(1 + 15 * nchannels);
> > +            x[1]  = PCM(1 + 7 * nchannels);
> > +            x[2]  = PCM(1 + 14 * nchannels);
> > +            x[3]  = PCM(1 + 8 * nchannels);
> > +            x[4]  = PCM(1 + 13 * nchannels);
> > +            x[5]  = PCM(1 + 9 * nchannels);
> > +            x[6]  = PCM(1 + 12 * nchannels);
> > +            x[7]  = PCM(1 + 10 * nchannels);
> > +            x[8]  = PCM(1 + 11 * nchannels);
> > +            x[9]  = PCM(1 + 3 * nchannels);
> > +            x[10] = PCM(1 + 6 * nchannels);
> > +            x[11] = PCM(1 + 0 * nchannels);
> > +            x[12] = PCM(1 + 5 * nchannels);
> > +            x[13] = PCM(1 + 1 * nchannels);
> > +            x[14] = PCM(1 + 4 * nchannels);
> > +            x[15] = PCM(1 + 2 * nchannels);
> > +        }
> > +        pcm += 32 * nchannels;
> > +        nsamples -= 16;
> > +    }
> > +
> > +    if (nsamples == 8) {
> > +        position -= 8;
> > +        if (nchannels > 0) {
> > +            int16_t *x = &X[0][position];
> > +            x[-7] = PCM(0 + 7 * nchannels);
> > +            x[1]  = PCM(0 + 3 * nchannels);
> > +            x[2]  = PCM(0 + 6 * nchannels);
> > +            x[3]  = PCM(0 + 0 * nchannels);
> > +            x[4]  = PCM(0 + 5 * nchannels);
> > +            x[5]  = PCM(0 + 1 * nchannels);
> > +            x[6]  = PCM(0 + 4 * nchannels);
> > +            x[7]  = PCM(0 + 2 * nchannels);
> > +        }
> > +        if (nchannels > 1) {
> > +            int16_t *x = &X[1][position];
> > +            x[-7] = PCM(1 + 7 * nchannels);
> > +            x[1]  = PCM(1 + 3 * nchannels);
> > +            x[2]  = PCM(1 + 6 * nchannels);
> > +            x[3]  = PCM(1 + 0 * nchannels);
> > +            x[4]  = PCM(1 + 5 * nchannels);
> > +            x[5]  = PCM(1 + 1 * nchannels);
> > +            x[6]  = PCM(1 + 4 * nchannels);
> > +            x[7]  = PCM(1 + 2 * nchannels);
> > +        }
> > +    }
> > +
> > +    return position;
> > +}
> > +
> > +/*
> > + * Input data processing functions. The data is endian converted if
> > needed,
> > + * channels are deintrleaved and audio samples are reordered for use in
> > + * SIMD-friendly analysis filter function. The results are put into "X"
> > + * array, getting appended to the previous data (or it is better to say
> > + * prepended, as the buffer is filled from top to bottom). Old data is
> > + * discarded when neededed, but availability of (10 * nrof_subbands)
> > + * contiguous samples is always guaranteed for the input to the analysis
> > + * filter. This is achieved by copying a sufficient part of old data
> > + * to the top of the buffer on buffer wraparound.
> > + */
> > +
> > +static int sbc_enc_process_input_4s(int position, const uint8_t *pcm,
> > +                                    int16_t X[2][SBC_X_BUFFER_SIZE],
> > +                                    int nsamples, int nchannels)
> > +{
> > +    if (nchannels > 1)
> > +        return sbc_encoder_process_input_s4_internal(
> > +            position, pcm, X, nsamples, 2);
> > +    else
> > +        return sbc_encoder_process_input_s4_internal(
> > +            position, pcm, X, nsamples, 1);
> > +}
> >
> 
> That's just silly, do
> return sbc_encoder_process_input_s4_internal(position, pcm, X, nsamples, 1
> + (nchannels > 1));

The point was to get the sbc_encoder_process_input_s4_internal inlined
in 2 different ways depending on its last parameter (constant), for
compiler optimization purpose.

> Or better yet remove the wrapper function.

That's what I did, without significant performance difference.
I guess compliers got better at optimizing this kind of code since
it was first written.

> > +
> > +static int sbc_enc_process_input_8s(int position, const uint8_t *pcm,
> > +                                    int16_t X[2][SBC_X_BUFFER_SIZE],
> > +                                    int nsamples, int nchannels)
> > +{
> > +    if (nchannels > 1)
> > +        return sbc_encoder_process_input_s8_internal(
> > +            position, pcm, X, nsamples, 2);
> > +    else
> > +        return sbc_encoder_process_input_s8_internal(
> > +            position, pcm, X, nsamples, 1);
> > +}
> > +
> >
> 
> Same here.

Wrapper removed.

> >
> > +++ b/libavcodec/sbcdsp_data.c
> > @@ -0,0 +1,335 @@
> > +/*
> > + * Bluetooth low-complexity, subband codec (SBC)
> > + *
> > + * Copyright (C) 2017  Aurelien Jacobs <au...@gnuage.org>
> > + * Copyright (C) 2008-2010  Nokia Corporation
> > + * Copyright (C) 2004-2010  Marcel Holtmann <mar...@holtmann.org>
> > + * Copyright (C) 2004-2005  Henryk Ploetz <hen...@ploetzli.ch>
> > + * Copyright (C) 2005-2006  Brad Midgley <bmidg...@xmission.com>
> > + *
> > + * This file is part of FFmpeg.
> > + *
> > + * FFmpeg is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU Lesser General Public
> > + * License as published by the Free Software Foundation; either
> > + * version 2.1 of the License, or (at your option) any later version.
> > + *
> > + * FFmpeg is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > + * Lesser General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU Lesser General Public
> > + * License along with FFmpeg; if not, write to the Free Software
> > + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
> > 02110-1301 USA
> > + */
> > +
> > +/**
> > + * @file
> > + * miscellaneous SBC tables
> > + */
> > +
> > +#include "sbcdsp_data.h"
> > +
> > +#define F_PROTO4(x) (int32_t) ((x * 2) * \
> > +    ((int32_t) 1 << (sizeof(int16_t) * CHAR_BIT - 1)) + 0.5)
> > +#define F_COS4(x) (int32_t) ((x) * \
> > +    ((int32_t) 1 << (sizeof(int16_t) * CHAR_BIT - 1)) + 0.5)
> > +#define F_PROTO8(x) (int32_t) ((x * 2) * \
> > +    ((int32_t) 1 << (sizeof(int16_t) * CHAR_BIT - 1)) + 0.5)
> > +#define F_COS8(x) (int32_t) ((x) * \
> > +    ((int32_t) 1 << (sizeof(int16_t) * CHAR_BIT - 1)) + 0.5)
> > +
> >
> 
> 
> We require 8 bit bytes, so s/CHAR_BIT/8/g throughout.

OK.

> +++ b/libavcodec/sbcdsp_data.h
> > @@ -0,0 +1,57 @@
> > +/*
> > + * Bluetooth low-complexity, subband codec (SBC)
> > + *
> > + * Copyright (C) 2017  Aurelien Jacobs <au...@gnuage.org>
> > + * Copyright (C) 2008-2010  Nokia Corporation
> > + * Copyright (C) 2004-2010  Marcel Holtmann <mar...@holtmann.org>
> > + * Copyright (C) 2004-2005  Henryk Ploetz <hen...@ploetzli.ch>
> > + * Copyright (C) 2005-2006  Brad Midgley <bmidg...@xmission.com>
> > + *
> > + * This file is part of FFmpeg.
> > + *
> > + * FFmpeg is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU Lesser General Public
> > + * License as published by the Free Software Foundation; either
> > + * version 2.1 of the License, or (at your option) any later version.
> > + *
> > + * FFmpeg is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > + * Lesser General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU Lesser General Public
> > + * License along with FFmpeg; if not, write to the Free Software
> > + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
> > 02110-1301 USA
> > + */
> > +
> > +/**
> > + * @file
> > + * miscellaneous SBC tables
> > + */
> > +
> > +#ifndef AVCODEC_SBCDSP_DATA_H
> > +#define AVCODEC_SBCDSP_DATA_H
> > +
> > +#include "sbc.h"
> > +
> > +#define SBC_PROTO_FIXED4_SCALE      ((sizeof(int16_t) * CHAR_BIT - 1) + 1)
> > +#define SBC_COS_TABLE_FIXED4_SCALE  ((sizeof(int16_t) * CHAR_BIT - 1)    )
> > +#define SBC_PROTO_FIXED8_SCALE      ((sizeof(int16_t) * CHAR_BIT - 1) + 1)
> > +#define SBC_COS_TABLE_FIXED8_SCALE  ((sizeof(int16_t) * CHAR_BIT - 1)    )
> > +
> >
> 
> Same

OK.

> >
> > +
> > +    /* align the last crc byte */
> > +    if (crc_pos % 8)
> > +        crc_header[crc_pos >> 3] <<= 8 - (crc_pos % 8);
> >
> >
> put_bits_align?

I gave it a try but it made the code more complex and less readable,
so I kept the code as it was.

> > +    avpkt->data[3] = ff_sbc_crc8(crc_header, crc_pos);
> > +
> > +    ff_sbc_calculate_bits(frame, bits);
> > +
> > +    for (ch = 0; ch < frame_channels; ch++) {
> > +        for (sb = 0; sb < frame_subbands; sb++) {
> > +            levels[ch][sb] = ((1 << bits[ch][sb]) - 1) <<
> > +                (32 - (frame->scale_factor[ch][sb] +
> > +                    SCALE_OUT_BITS + 2));
> > +            sb_sample_delta[ch][sb] = (uint32_t) 1 <<
> > +                (frame->scale_factor[ch][sb] +
> > +                    SCALE_OUT_BITS + 1);
> > +        }
> > +    }
> > +
> > +    for (blk = 0; blk < frame->blocks; blk++) {
> > +        for (ch = 0; ch < frame_channels; ch++) {
> > +            for (sb = 0; sb < frame_subbands; sb++) {
> > +
> > +                if (bits[ch][sb] == 0)
> > +                    continue;
> > +
> > +                audio_sample = ((uint64_t) levels[ch][sb] *
> > +                    (sb_sample_delta[ch][sb] +
> > +                    frame->sb_sample_f[blk][ch][sb])) >> 32;
> > +
> > +                put_bits(&pb, bits[ch][sb], audio_sample);
> > +            }
> > +        }
> > +    }
> > +
> > +    flush_put_bits(&pb);
> > +
> > +    return (put_bits_count(&pb) + 7) / 8;
> > +}
> > +
> > +static size_t sbc_pack_frame(AVPacket *avpkt, struct sbc_frame *frame,
> > int joint)
> > +{
> > +    int frame_subbands = 4;
> > +
> > +    avpkt->data[0] = SBC_SYNCWORD;
> > +
> > +    avpkt->data[1] = (frame->frequency & 0x03) << 6;
> > +    avpkt->data[1] |= (frame->block_mode & 0x03) << 4;
> > +    avpkt->data[1] |= (frame->mode & 0x03) << 2;
> > +    avpkt->data[1] |= (frame->allocation & 0x01) << 1;
> > +
> >
> 
> Use put_bits?

For just writting flags in one byte ? This seems overkill !

> > +
> > +    if (frame->subbands == 4) {
> > +        if (frame->channels == 1)
> > +            return sbc_pack_frame_internal(avpkt, frame, 4, 1, joint);
> >
> 
> return sbc_pack_frame_internal(avpkt, frame, 4, 1 + (frame->channels == 1),
> joint);
> 
> 
> > +            return sbc_pack_frame_internal(avpkt, frame, 8, 1, joint);
> > +        else
> > +            return sbc_pack_frame_internal(avpkt, frame, 8, 2, joint);
> >
> 
> return sbc_pack_frame_internal(avpkt, frame, 8, 1 + (frame->channels == 1),
> joint);

OK, improved with a single call to sbc_pack_frame_internal.

> > +    }
> > +}
> > +
> > +static size_t msbc_pack_frame(AVPacket *avpkt, struct sbc_frame *frame,
> > int joint)
> > +{
> > +    avpkt->data[0] = MSBC_SYNCWORD;
> > +    avpkt->data[1] = 0;
> > +    avpkt->data[2] = 0;
> > +
> > +    return sbc_pack_frame_internal(avpkt, frame, 8, 1, joint);
> > +}
> > +
> > +static void sbc_encoder_init(bool msbc, SBCDSPContext *s,
> > +                             const struct sbc_frame *frame)
> > +{
> > +    memset(&s->X, 0, sizeof(s->X));
> > +    s->position = (SBC_X_BUFFER_SIZE - frame->subbands * 9) & ~7;
> > +    if (msbc)
> > +        s->increment = 1;
> > +    else
> > +        s->increment = 4;
> 
> +
> >
> 
> Save a line, use a ternary.

OK.

> > +
> > +    sbc->pack_frame = sbc_pack_frame;
> > +
> > +    sbc->frequency = SBC_FREQ_44100;
> >
> 
> 
> Yet in the AVCodec structure the encoder specifies it supports 16khz, 32khz
> and 48khz.

Indeed, forcing to 44100 was a leftover. SBC actually support 16, 32,
44.1 and 48 kHz.

> You should remove the SBC_FREQ macros and use avctx->sample_rate directly.
> Also remove any unsupported samplerates.

Those macros correspond to the actual values that have to be written in
the SBC bitstream.

> > +    sbc->mode = SBC_MODE_STEREO;
> > +    if (sbc->joint_stereo)
> > +        sbc->mode = SBC_MODE_JOINT_STEREO;
> > +    else if (sbc->dual_channel)
> > +        sbc->mode = SBC_MODE_DUAL_CHANNEL;
> > +    sbc->subbands >>= 3;
> > +    sbc->blocks = (sbc->blocks >> 2) - 1;
> > +
> > +    if (!avctx->frame_size)
> > +        avctx->frame_size = 4*(sbc->subbands + 1) * 4*(sbc->blocks + 1);
> > +
> > +    for (int i = 0; avctx->codec->supported_samplerates[i]; i++)
> > +        if (avctx->sample_rate == avctx->codec->supported_samplerates[i])
> > +            sbc->frequency = i;
> > +
> > +    if (avctx->channels == 1)
> > +        sbc->mode = SBC_MODE_MONO;
> > +
> > +    return 0;
> > +}
> > +
> > +static int msbc_encode_init(AVCodecContext *avctx)
> > +{
> > +    SBCEncContext *sbc = avctx->priv_data;
> > +
> > +    sbc->msbc = true;
> > +    sbc->pack_frame = msbc_pack_frame;
> > +
> > +    sbc->frequency = SBC_FREQ_16000;
> > +    sbc->blocks = MSBC_BLOCKS;
> > +    sbc->subbands = SBC_SB_8;
> > +    sbc->mode = SBC_MODE_MONO;
> > +    sbc->allocation = SBC_AM_LOUDNESS;
> > +    sbc->bitpool = 26;
> > +
> > +    if (!avctx->frame_size)
> > +        avctx->frame_size = 8 * MSBC_BLOCKS;
> > +
> >
> 
> Does the encoder actually accept arbitrary custom frame sizes?

Indeed no. Fixed.

> >
> > +
> > +#if CONFIG_SBC_ENCODER
> > +AVCodec ff_sbc_encoder = {
> > +    .name                  = "sbc",
> > +    .long_name             = NULL_IF_CONFIG_SMALL("SBC (low-complexity
> > subband codec)"),
> > +    .type                  = AVMEDIA_TYPE_AUDIO,
> > +    .id                    = AV_CODEC_ID_SBC,
> > +    .priv_data_size        = sizeof(SBCEncContext),
> > +    .init                  = sbc_encode_init,
> > +    .encode2               = sbc_encode_frame,
> > +    .channel_layouts       = (const uint64_t[]) { AV_CH_LAYOUT_MONO,
> > +                                                  AV_CH_LAYOUT_STEREO, 0},
> > +    .sample_fmts           = (const enum AVSampleFormat[]) {
> > AV_SAMPLE_FMT_S16,
> > +
> >  AV_SAMPLE_FMT_NONE },
> >
> 
> Planar?

Not quite. The whole MMX / arm code is written for interlaced input and
I don't plane to rewrite it.

> > +    .supported_samplerates = (const int[]) { 16000, 32000, 44100, 48000,
> > 0 },
> >
> 
> Remove the samplerates the encoder doesn't support.

The encoder actually support all those samplerates.

> Also add the internal codec cap about threadsafe init since the encoder
> doesn't init any global tables to both this and the aptX encoders.

Done.

> > +
> > +;*******************************************************************
> > +;void ff_sbc_analyze_4(const int16_t *in, int32_t *out, const int16_t
> > *consts);
> > +;*******************************************************************
> > +INIT_MMX mmx
> > +cglobal sbc_analyze_4, 3, 3, 4, in, out, consts
> > +    movq          m0, [inq]
> > +    movq          m1, [inq+8]
> > +    pmaddwd       m0, [constsq]
> > +    pmaddwd       m1, [constsq+8]
> > +    paddd         m0, [scale_mask]
> > +    paddd         m1, [scale_mask]
> > +
> > +    movq          m2, [inq+16]
> > +    movq          m3, [inq+24]
> > +    pmaddwd       m2, [constsq+16]
> > +    pmaddwd       m3, [constsq+24]
> > +    paddd         m0, m2
> > +    paddd         m1, m3
> > +
> > +    movq          m2, [inq+32]
> > +    movq          m3, [inq+40]
> > +    pmaddwd       m2, [constsq+32]
> > +    pmaddwd       m3, [constsq+40]
> > +    paddd         m0, m2
> > +    paddd         m1, m3
> > +
> > +    movq          m2, [inq+48]
> > +    movq          m3, [inq+56]
> > +    pmaddwd       m2, [constsq+48]
> > +    pmaddwd       m3, [constsq+56]
> > +    paddd         m0, m2
> > +    paddd         m1, m3
> > +
> > +    movq          m2, [inq+64]
> > +    movq          m3, [inq+72]
> > +    pmaddwd       m2, [constsq+64]
> > +    pmaddwd       m3, [constsq+72]
> > +    paddd         m0, m2
> > +    paddd         m1, m3
> > +
> >
> 
> Loops?
> 
> 
> > +    psrad         m0, 16    ; SBC_PROTO_FIXED4_SCALE
> > +    psrad         m1, 16    ; SBC_PROTO_FIXED4_SCALE
> > +    packssdw      m0, m0
> > +    packssdw      m1, m1
> > +
> > +    movq          m2, m0
> > +    pmaddwd       m0, [constsq+80]
> > +    pmaddwd       m2, [constsq+88]
> > +
> > +    movq          m3, m1
> > +    pmaddwd       m1, [constsq+96]
> > +    pmaddwd       m3, [constsq+104]
> > +    paddd         m0, m1
> > +    paddd         m2, m3
> > +
> > +    movq          [outq  ], m0
> > +    movq          [outq+8], m2
> > +
> > +    RET
> > +
> > +
> > +
> > +;*******************************************************************
> > +;void ff_sbc_analyze_8(const int16_t *in, int32_t *out, const int16_t
> > *consts);
> > +;*******************************************************************
> > +INIT_MMX mmx
> > +cglobal sbc_analyze_8, 3, 3, 4, in, out, consts
> > +    movq          m0, [inq]
> > +    movq          m1, [inq+8]
> > +    movq          m2, [inq+16]
> > +    movq          m3, [inq+24]
> > +    pmaddwd       m0, [constsq]
> > +    pmaddwd       m1, [constsq+8]
> > +    pmaddwd       m2, [constsq+16]
> > +    pmaddwd       m3, [constsq+24]
> > +    paddd         m0, [scale_mask]
> > +    paddd         m1, [scale_mask]
> > +    paddd         m2, [scale_mask]
> > +    paddd         m3, [scale_mask]
> > +
> > +    movq          m4, [inq+32]
> > +    movq          m5, [inq+40]
> > +    movq          m6, [inq+48]
> > +    movq          m7, [inq+56]
> > +    pmaddwd       m4, [constsq+32]
> > +    pmaddwd       m5, [constsq+40]
> > +    pmaddwd       m6, [constsq+48]
> > +    pmaddwd       m7, [constsq+56]
> > +    paddd         m0, m4
> > +    paddd         m1, m5
> > +    paddd         m2, m6
> > +    paddd         m3, m7
> > +
> > +    movq          m4, [inq+64]
> > +    movq          m5, [inq+72]
> > +    movq          m6, [inq+80]
> > +    movq          m7, [inq+88]
> > +    pmaddwd       m4, [constsq+64]
> > +    pmaddwd       m5, [constsq+72]
> > +    pmaddwd       m6, [constsq+80]
> > +    pmaddwd       m7, [constsq+88]
> > +    paddd         m0, m4
> > +    paddd         m1, m5
> > +    paddd         m2, m6
> > +    paddd         m3, m7
> > +
> > +    movq          m4, [inq+96]
> > +    movq          m5, [inq+104]
> > +    movq          m6, [inq+112]
> > +    movq          m7, [inq+120]
> > +    pmaddwd       m4, [constsq+96]
> > +    pmaddwd       m5, [constsq+104]
> > +    pmaddwd       m6, [constsq+112]
> > +    pmaddwd       m7, [constsq+120]
> > +    paddd         m0, m4
> > +    paddd         m1, m5
> > +    paddd         m2, m6
> > +    paddd         m3, m7
> > +
> > +    movq          m4, [inq+128]
> > +    movq          m5, [inq+136]
> > +    movq          m6, [inq+144]
> > +    movq          m7, [inq+152]
> > +    pmaddwd       m4, [constsq+128]
> > +    pmaddwd       m5, [constsq+136]
> > +    pmaddwd       m6, [constsq+144]
> > +    pmaddwd       m7, [constsq+152]
> > +    paddd         m0, m4
> > +    paddd         m1, m5
> > +    paddd         m2, m6
> > +    paddd         m3, m7
> > +
> > +    psrad         m0, 16    ; SBC_PROTO_FIXED8_SCALE
> > +    psrad         m1, 16    ; SBC_PROTO_FIXED8_SCALE
> > +    psrad         m2, 16    ; SBC_PROTO_FIXED8_SCALE
> > +    psrad         m3, 16    ; SBC_PROTO_FIXED8_SCALE
> > +
> > +    packssdw      m0, m0
> > +    packssdw      m1, m1
> > +    packssdw      m2, m2
> > +    packssdw      m3, m3
> > +
> > +    movq          m4, m0
> > +    movq          m5, m0
> > +    pmaddwd       m4, [constsq+160]
> > +    pmaddwd       m5, [constsq+168]
> > +
> > +    movq          m6, m1
> > +    movq          m7, m1
> > +    pmaddwd       m6, [constsq+192]
> > +    pmaddwd       m7, [constsq+200]
> > +    paddd         m4, m6
> > +    paddd         m5, m7
> > +
> > +    movq          m6, m2
> > +    movq          m7, m2
> > +    pmaddwd       m6, [constsq+224]
> > +    pmaddwd       m7, [constsq+232]
> > +    paddd         m4, m6
> > +    paddd         m5, m7
> > +
> > +    movq          m6, m3
> > +    movq          m7, m3
> > +    pmaddwd       m6, [constsq+256]
> > +    pmaddwd       m7, [constsq+264]
> > +    paddd         m4, m6
> > +    paddd         m5, m7
> > +
> > +    movq          [outq  ], m4
> > +    movq          [outq+8], m5
> > +
> > +    movq          m5, m0
> > +    pmaddwd       m0, [constsq+176]
> > +    pmaddwd       m5, [constsq+184]
> > +
> > +    movq          m7, m1
> > +    pmaddwd       m1, [constsq+208]
> > +    pmaddwd       m7, [constsq+216]
> > +    paddd         m0, m1
> > +    paddd         m5, m7
> > +
> > +    movq          m7, m2
> > +    pmaddwd       m2, [constsq+240]
> > +    pmaddwd       m7, [constsq+248]
> > +    paddd         m0, m2
> > +    paddd         m5, m7
> > +
> > +    movq          m7, m3
> > +    pmaddwd       m3, [constsq+272]
> > +    pmaddwd       m7, [constsq+280]
> > +    paddd         m0, m3
> > +    paddd         m5, m7
> > +
> 
> 
> Has the person writing the SIMD seriously not heard of loops?

I guess this person actually did loop unrolling on purpose.

> I see no reason for this to not work on larger registers if loops were used
> here.
> This seems trivial do to properly so if you can't be bothered to fix it
> leave it to me or jamrial to do after the core of the encoder has been
> merged.

I will leave it to you.

> > +    movq          [outq+16], m0
> > +    movq          [outq+24], m5
> > +
> > +    RET
> > +
> > +
> > +;*******************************************************************
> > +;void ff_sbc_calc_scalefactors(int32_t sb_sample_f[16][2][8],
> > +;                              uint32_t scale_factor[2][8],
> > +;                              int blocks, int channels, int subbands)
> > +;*******************************************************************
> > +INIT_MMX mmx
> > +cglobal sbc_calc_scalefactors, 5, 9, 3, sb_sample_f, scale_factor,
> > blocks, channels, subbands, ch, sb, sa, sf, blk
> > +    shl           channelsd, 5
> > +    mov           chq, 0
> > +.loop_1:
> > +    lea           saq, [sb_sample_fq + chq]
> > +    lea           sfq, [scale_factorq + chq]
> > +
> > +    mov           sbd, 0
> > +.loop_2:
> > +    ; blk = (blocks - 1) * 64;
> > +    lea           blkq, [blocksq - 1]
> > +    shl           blkd, 6
> > +
> > +    movq          m0, [scale_mask]
> > +.loop_3:
> > +    movq          m1, [saq+blkq]
> > +    pxor          m2, m2
> > +    pcmpgtd       m1, m2
> > +    paddd         m1, [saq+blkq]
> > +    pcmpgtd       m2, m1
> > +    pxor          m1, m2
> > +
> > +    por           m0, m1
> > +
> > +    sub           blkd, 64
> > +    jns           .loop_3
> > +
> > +    movd          blkd, m0
> > +    psrlq         m0,   32
> > +    bsr           blkd, blkd
> > +    sub           blkd, 15    ; SCALE_OUT_BITS
> > +    mov           [sfq], blkd
> > +
> > +    movd          blkd, m0
> > +    bsr           blkd, blkd
> > +    sub           blkd, 15    ; SCALE_OUT_BITS
> > +    mov           [sfq+4], blkd
> > +
> > +    add           saq, 8
> > +    add           sfq, 8
> > +
> > +    add           sbd, 2
> > +    cmp           sbd, subbandsd
> > +    jl            .loop_2
> > +
> > +    add           chd, 32
> > +    cmp           chd, channelsd
> > +    jl            .loop_1
> > +
> >
> 
> This function's hardly doing SIMD and I would like to see comparison to the
> C version before accepting it. I somehow doubt it'll be faster.

It is actually slightly faster.
Here is the best speed I get encoding one file with default settings.
C version:    speed= 723x
MMX version:  speed= 756x

> > +av_cold void ff_sbcdsp_init_x86(SBCDSPContext *s)
> > +{
> > +    int cpu_flags = av_get_cpu_flags();
> > +
> > +    if (EXTERNAL_MMX(cpu_flags)) {
> > +        s->sbc_analyze_4 = ff_sbc_analyze_4_mmx;
> > +        s->sbc_analyze_8 = ff_sbc_analyze_8_mmx;
> > +        s->sbc_calc_scalefactors = ff_sbc_calc_scalefactors_mmx;
> > +    }
> > +}
> >
> 
> 
> MMX? In this day and age?

Well, this code is not very recent...
But throwing some AVX in there would probably be nice if you feel
inclined.

> Anyway, its mostly not bad, will need some work before its cleaned of
> libsbc's NIH.

Should be better in the patchset that I will send soon.
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [PATCH 1/3] sbc: implement SBC codec (low-complexity subband codec)

Reply via email to