I'm sorry that I'm so at replying.

On 2013-03-29 John Reiser wrote:
> Below is a summary of encoding costs that I gleaned by inspecting the
> code. Notable:
>   A match of length 3 is no shorter than 3 literal bytes when
> 64K<=offset. A match of length 1 at rep0==offset is an important
> special case.

Your calculations look correct to me. They are useful as is when in fast
mode, but in normal mode it's not so simple. In the normal mode one
shouldn't make so simplified assumptions about the costs.

> LZMA encoding costs (in bits) before RangeEncoder (after
> RangeDecoder.) RangeEncoder often reduces the cost in bits, but it
> depends on history and is difficult to compute.  [On average is it a
> small constant factor?]

A key thing in the normal mode (compared to the fast mode) is that
the algorithm takes into account the costs after the range encoding (the
code uses the term "price"). Up to 4 KiB of uncompressed data is
analyzed at a time. The cheapest combination of LZMA symbols to
represent the analyzed range as a whole is chosen. To speed it up, some
things are cached in lookup tables that are updated only now and then.
This means that the calculation isn't always done with the exact prices
as the real prices drift away from the cached values.

The price of a symbol depends on the alignment in the uncompressed data
via pos_state (position bits (pb) setting). Alignment also affects
literal encoding via literal position bits (lp), but that is usually
zero to indicate one-byte alignment.

The price of a symbol also depends on the previous two or three
LZMA symbols(*) via the "state" variable. For example, if the situation
"the previous symbols were a normal match and a literal, and the current
position % pos_state == 3" has occurred several times earlier and in
most cases the next symbol has been for example a repeated match,
encoding a repeated match in such a situation has become a little bit
cheaper than it would be in the base state. This may make the encoder
choose, for example, a shorter repeated match over a longer normal match
in the same situation in the future. (Not a great example but you get
the idea, I hope.)

(*) lzma_common.h seems to talk about events, but later I've switched
    to using the term "LZMA symbol" or plain "symbol" to mean a
    literal, normal match, or repeated match. There are variables
    named "symbol" too, but I don't speak about those in this email.
    Some code cleanup would be good to do. :-|

If you want to read the normal mode code, see LZMAEncoderNormal.java
and other files in XZ for Java. Those are way more readable than the
equivalent code currently in XZ Utils even if you had never seen Java
code before.

Your original question was about what kind of "obviously bad" matches a
match finder could throw away. The results you calculated might be
useful for the fast mode, but using those for the normal mode may harm
the compression ratio. Maybe with some safety margins it would work
well, but that is purely a guess. You could try different values or you
could even analyze what kind of symbol combinations the encoder
currently creates.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

Reply via email to