[Jprogramming] LZ77

bill lam Wed, 17 Sep 2014 02:24:07 -0700

The LZ77 variation used in deflate is at the bottom (copied from rfc
1951).  I tried to translate it into J using a hash function which
interprets XYZ as the lower 3 bytes of a 4 byte integer.  My J
implementation worked but efficient is rather poor.  I think the
inefficiency of my implementation came from the need to
1. build a vector of pre-computed hash of size equal to sliding
window. (although the hash of all bytes had been pre-computed)
2. find the last index of the hash of current XYZ in the vector.
(since the left argument changes, special code of the form m&i: does not apply)


If data are truly random, the chance of matching would be very small,
thus the time spent should roughly the sum of the above 1 & 2 for
every byte, however it took a very long time even for data size of
only about 500KB.  Any idea for a better J implementation.

NB. for deflate, minimum length for match is 3, max length is 285,
maximum distance is 32K.

eg. LZ77 compression of the input 500#'abc'
is
aa(285,1)(213,1)b(285,1)(214,1)c(285,1)(214,1)

inside each bracket is (length,distance) pair applied to the  history
buffer. Untested but should be something like this.

(text from rfc)
The compressor uses a chained hash table to find duplicated strings,
   using a hash function that operates on 3-byte sequences.  At any
   given point during compression, let XYZ be the next 3 input bytes to
   be examined (not necessarily all different, of course).  First, the
   compressor examines the hash chain for XYZ.  If the chain is empty,
   the compressor simply writes out X as a literal byte and advances one
   byte in the input.  If the hash chain is not empty, indicating that
   the sequence XYZ (or, if we are unlucky, some other 3 bytes with the
   same hash function value) has occurred recently, the compressor
   compares all strings on the XYZ hash chain with the actual input data
   sequence starting at the current point, and selects the longest
   match.

   The compressor searches the hash chains starting with the most recent
   strings, to favor small distances and thus take advantage of the
   Huffman encoding.  The hash chains are singly linked. There are no
   deletions from the hash chains; the algorithm simply discards matches
   that are too old.  To avoid a worst-case situation, very long hash
   chains are arbitrarily truncated at a certain length, determined by a
   run-time parameter.
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

[Jprogramming] LZ77

Reply via email to