On 2013-03-25, Lasse Collin wrote:
> On 2013-03-18 John Reiser wrote:

>> 2) for each of the four match lengths 2,3,4, and 5: the maximum offset
>>    that yields a savings when encoded, but ignoring the possibility of
>>    special savings due to using one of the four most-recent offsets.
>>    For instance, gzip won't even consider any offset greater than 4096
>>    ("TOO_FAR") for gzip's minimum match length of 3.
> 
> I don't have a complete answer for that at the moment. In fast mode,
> distances longer than 128 bytes are ignored when length is 2 bytes. In
> other cases it's more complicated.

Below is a summary of encoding costs that I gleaned by inspecting the code.
Notable:
  A match of length 3 is no shorter than 3 literal bytes when 64K<=offset.
  A match of length 1 at rep0==offset is an important special case.

-----
LZMA encoding costs (in bits) before RangeEncoder (after RangeDecoder.)
RangeEncoder often reduces the cost in bits, but it depends on history
and is difficult to compute.  [On average is it a small constant factor?]

cost   stream       Decoding   (liblzma/lzma/lzma_decoder.c)
====  =========  =====================
 9    0xxxxxxxx  literal 8-bit byte
 2+   10         general match; Offset and Length follow
 4    1100       rep0==offset;  1==len    [yes, 1]
 4+   1101       rep0==offset;  Length follows
 4+   1110       rep1==offset;  Length follows
 5+   11110      rep2==offset;  Length follows
 5+   11111      rep3==offset;  Length follows


cost  stream             binary    decimal   Offset
====  ===============  =========  ========
  6   0000xy                  xy    0..  3   [yes, 0]
  7   00010c d               1cd    4..  7
  8   00011c de             1ced    8.. 15
  9   00100c def           1cfed   16.. 31
 10   00101c defg         1cgfed   32.. 63
 11   00110c defgh       1chgfed   64..127
 12   00111c xy defg    1cxygfed  128..255
 13   01000c xyz defg  1cxyzgfed  256..511
   ...
 6+(-2+ bit_width(offset))  4<=offset


cost    stream       Length   (liblzma/lzma/lzma_common.h)
====  ==========  ============
  4   0xxx         2<=len<=  9
  5   10xxx       10<=len<= 17
 10   11xxxxxxxx  18<=len<=273


Two consecutive literal bytes costs 2*9 = 18 bits.

A match of length 2 bytes at general offset costs:
 12 = (2+ 6+ 4)  if  offset in {1,2,3}
 17 = (2+11+ 4)  if   64<=offset<=127
 18 = (2+12+ 4)  if  128<=offset<=255
 19 = (2+13+ 4)  if  256<=offset<=511
Therefore a match of length 2 at 128<=offset does not save any space,
and a match of length 2 at 256<=offset costs more than two literals.
(The side effect of setting rep0 might be worth it overall.)

A match of length 3 bytes at general offset costs:
 26 = (2+20+ 4)  if  (1<<15)<=offset < (1<<16)
 27 = (2+21+ 4)  if  (1<<16)<=offset < (1<<17)
 28 = (2+22 +4)  if  (1<<17)<=offset < (1<<18)
3 literal bytes cost no more than a match of length 3 with 64K<=offset.

4 literal bytes cost no more than a match of length 4 with 32M<=offset.

A match of length 2 at rep0==offset also can be two matches of length 1.

A match of length 18 at general offset is 1 bit shorter if encoded
as a match of length 17 followed by a match of length 1 at rep0==offset.

-- 


Reply via email to