On 2013-03-25, Lasse Collin wrote: > On 2013-03-18 John Reiser wrote: >> 2) for each of the four match lengths 2,3,4, and 5: the maximum offset >> that yields a savings when encoded, but ignoring the possibility of >> special savings due to using one of the four most-recent offsets. >> For instance, gzip won't even consider any offset greater than 4096 >> ("TOO_FAR") for gzip's minimum match length of 3. > > I don't have a complete answer for that at the moment. In fast mode, > distances longer than 128 bytes are ignored when length is 2 bytes. In > other cases it's more complicated.
Below is a summary of encoding costs that I gleaned by inspecting the code. Notable: A match of length 3 is no shorter than 3 literal bytes when 64K<=offset. A match of length 1 at rep0==offset is an important special case. ----- LZMA encoding costs (in bits) before RangeEncoder (after RangeDecoder.) RangeEncoder often reduces the cost in bits, but it depends on history and is difficult to compute. [On average is it a small constant factor?] cost stream Decoding (liblzma/lzma/lzma_decoder.c) ==== ========= ===================== 9 0xxxxxxxx literal 8-bit byte 2+ 10 general match; Offset and Length follow 4 1100 rep0==offset; 1==len [yes, 1] 4+ 1101 rep0==offset; Length follows 4+ 1110 rep1==offset; Length follows 5+ 11110 rep2==offset; Length follows 5+ 11111 rep3==offset; Length follows cost stream binary decimal Offset ==== =============== ========= ======== 6 0000xy xy 0.. 3 [yes, 0] 7 00010c d 1cd 4.. 7 8 00011c de 1ced 8.. 15 9 00100c def 1cfed 16.. 31 10 00101c defg 1cgfed 32.. 63 11 00110c defgh 1chgfed 64..127 12 00111c xy defg 1cxygfed 128..255 13 01000c xyz defg 1cxyzgfed 256..511 ... 6+(-2+ bit_width(offset)) 4<=offset cost stream Length (liblzma/lzma/lzma_common.h) ==== ========== ============ 4 0xxx 2<=len<= 9 5 10xxx 10<=len<= 17 10 11xxxxxxxx 18<=len<=273 Two consecutive literal bytes costs 2*9 = 18 bits. A match of length 2 bytes at general offset costs: 12 = (2+ 6+ 4) if offset in {1,2,3} 17 = (2+11+ 4) if 64<=offset<=127 18 = (2+12+ 4) if 128<=offset<=255 19 = (2+13+ 4) if 256<=offset<=511 Therefore a match of length 2 at 128<=offset does not save any space, and a match of length 2 at 256<=offset costs more than two literals. (The side effect of setting rep0 might be worth it overall.) A match of length 3 bytes at general offset costs: 26 = (2+20+ 4) if (1<<15)<=offset < (1<<16) 27 = (2+21+ 4) if (1<<16)<=offset < (1<<17) 28 = (2+22 +4) if (1<<17)<=offset < (1<<18) 3 literal bytes cost no more than a match of length 3 with 64K<=offset. 4 literal bytes cost no more than a match of length 4 with 32M<=offset. A match of length 2 at rep0==offset also can be two matches of length 1. A match of length 18 at general offset is 1 bit shorter if encoded as a match of length 17 followed by a match of length 1 at rep0==offset. --