Hi, I've had a closer look at the get_bits and friends now. To me, that
code looks both pretty complex and not very efficient.

My understanding is that the typical and absolutely most important use
case is calls to get_bits with a small constant bit size. Next most
important are calls to get_bits with a small non-constant bit count, and
calls to get_bits_long.

As far as I see, each get_bits call boils down to a separate call to
UPDATE_CACHE (and a bunch of other macros). UPDATE_CACHE does a read
from the input buffer. This read looks expensive; it's potentially
unaligned, and it may need byte swapping. And it's not much of a cache,
since it's typically *not* saved between calls. And get_bits_long is
even worse, with two calls to get_bits, and two of these reads every
time.

I'd suggest instead doing something like this. I'm assuming that within
bytes, bits are read with the most significant bit first, but opposite
convention is also possible; not sure what's the most common convention.

#define LIKELY(cond) __builtin_expect ((cond) != 0, 1)

typedef struct GetBitContext {
    unsigned long bit_buffer;
    unsigned bits_left;       /* Number of bits left in bit_buffer, at low end 
*/
    const unsigned long *buf; /* Aligned pointer to input data */
    size_t bytes_left;        /* A *byte* count; data left need not be a
                                 multiple of sizeof(long) */
} GetBitContext;

unsigned inline get_bits (GetBitContext *gb, unsigned n)
{
    if (LIKELY (n <= gb->bits_left)) {
        gb->bits_left -= n;
        return (gb->bit_buffer >> gb->bits_left) & ((1UL<<n) - 1);
    }
    else return get_bits_slow (gb, n);
}

This should give compact inlined code with a comparison, a
well-predicted branch, a shift and a mask which often will be by a
constant. gb->bit_buffer acts as a kind of "cache". Everything hairy
should be done by non-inlined functions which are not as performance
critical. Sketch:

int init_get_bits(GetBitContext *s, const uint8_t *buffer, size_t buf_size)
{
    If buffer is unaligned, read initial bytes into gb->bit_buffer, updating 
gb->bits_left.

    Assign aligned pointer to gb->buf and set gb->bytes_left. 

    It's possible that we get to the end of the buffer before getting to
    alignment, in that case then *all* input bytes is copied into
    gb->bit_buffer, gb->bytes_left = 0 and and gb->buf = NULL
}

unsigned get_bits_slow (GetBitContext *gb, unsigned n)
{
    unsigned bits;

    assert (n > gb->bits_left);  /* Always true when called from get_bits */

    /* The field crosses a word boundary. Get first piece */
    bits = gb->bit_buffer & ((1 << gb->bits_left) - 1);
    n -= gb->bits_left;

    if (LIKELY(sizeof(unsigned long) <= gb->bytes_left) {
        gb->bit_buffer = byteswap_if_needed(*gb->buf++);
        gb->bytes_left -= sizeof(unsigned long);
        gb->bits_left = sizeof(unsigned long) * CHAR_BIT;
    }
    else {
        if (n < gb->bytes_left * CHAR_BIT) {
             reached end of data, return error in some way
        }

        read final bytes into bit_buffer
        gb->bits_left = gb->bytes_left * CHAR_BIT;
        gb->bytes_left = 0;
    }
    /* Get second piece */
    gb->bits_left -= n;
    return (bits << n) | (gb->bit_buffer >> gb->bits_left);
}

Some bonus points over the current implementation: all handling of end
of buffer is in an unlikely block of get_bits_slow, so there's no reason
to separate between safe and unsafe versions. And get_bits supports
sizes up to the size of an unsigned long (with no overhead besides
having to call get_bits_slow more often), so there's also no need for
special "long" variants.

skip_bits should be similarly split into a compact and fast inline
skip_bits, and a hairier non-inline skip_bits_slow.

When unsigned long is 64 bits, there will be a function call to
get_bits_slow only once per 64 bits parsed, and then, except when at the
end of the buffer, that's well predicted branch (in addition to the
prediction miss in get_bits), a single *aligned* read, possibly
byteswapping, and then the unavoidable shifting needed to put the pieces
from the two words together.

To do get_bits_count, one would need an additional context variable.
Since I'd expect get_bits_count to not be performance critical, and to
avoid extra book-keeping in the critical get_bits, it seems reasonable
to simply keep a pointer to the start of the buffer.

I'm sure there are further micro-optimizations, but I hope you get the
idea. If one likes to cache things in registers (and the compiler isn't
good enough at that), one could also do cached variants that copy
gb->bit_buffer and gb->bits_left (and nothing more!) into two local
variables. For that to make any sense, one should then keep those in
registers over multiple get_bits calls, and write them back only when
calling get_bits_slow and at return.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

_______________________________________________
libav-devel mailing list
[email protected]
https://lists.libav.org/mailman/listinfo/libav-devel

Reply via email to