I'm sorry that I'm so at replying. On 2013-03-29 John Reiser wrote: > Below is a summary of encoding costs that I gleaned by inspecting the > code. Notable: > A match of length 3 is no shorter than 3 literal bytes when > 64K<=offset. A match of length 1 at rep0==offset is an important > special case.
Your calculations look correct to me. They are useful as is when in fast mode, but in normal mode it's not so simple. In the normal mode one shouldn't make so simplified assumptions about the costs. > LZMA encoding costs (in bits) before RangeEncoder (after > RangeDecoder.) RangeEncoder often reduces the cost in bits, but it > depends on history and is difficult to compute. [On average is it a > small constant factor?] A key thing in the normal mode (compared to the fast mode) is that the algorithm takes into account the costs after the range encoding (the code uses the term "price"). Up to 4 KiB of uncompressed data is analyzed at a time. The cheapest combination of LZMA symbols to represent the analyzed range as a whole is chosen. To speed it up, some things are cached in lookup tables that are updated only now and then. This means that the calculation isn't always done with the exact prices as the real prices drift away from the cached values. The price of a symbol depends on the alignment in the uncompressed data via pos_state (position bits (pb) setting). Alignment also affects literal encoding via literal position bits (lp), but that is usually zero to indicate one-byte alignment. The price of a symbol also depends on the previous two or three LZMA symbols(*) via the "state" variable. For example, if the situation "the previous symbols were a normal match and a literal, and the current position % pos_state == 3" has occurred several times earlier and in most cases the next symbol has been for example a repeated match, encoding a repeated match in such a situation has become a little bit cheaper than it would be in the base state. This may make the encoder choose, for example, a shorter repeated match over a longer normal match in the same situation in the future. (Not a great example but you get the idea, I hope.) (*) lzma_common.h seems to talk about events, but later I've switched to using the term "LZMA symbol" or plain "symbol" to mean a literal, normal match, or repeated match. There are variables named "symbol" too, but I don't speak about those in this email. Some code cleanup would be good to do. :-| If you want to read the normal mode code, see LZMAEncoderNormal.java and other files in XZ for Java. Those are way more readable than the equivalent code currently in XZ Utils even if you had never seen Java code before. Your original question was about what kind of "obviously bad" matches a match finder could throw away. The results you calculated might be useful for the fast mode, but using those for the normal mode may harm the compression ratio. Maybe with some safety margins it would work well, but that is purely a guess. You could try different values or you could even analyze what kind of symbol combinations the encoder currently creates. -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode