The LZ77 variation used in deflate is at the bottom (copied from rfc 1951). I tried to translate it into J using a hash function which interprets XYZ as the lower 3 bytes of a 4 byte integer. My J implementation worked but efficient is rather poor. I think the inefficiency of my implementation came from the need to 1. build a vector of pre-computed hash of size equal to sliding window. (although the hash of all bytes had been pre-computed) 2. find the last index of the hash of current XYZ in the vector. (since the left argument changes, special code of the form m&i: does not apply)
If data are truly random, the chance of matching would be very small, thus the time spent should roughly the sum of the above 1 & 2 for every byte, however it took a very long time even for data size of only about 500KB. Any idea for a better J implementation. NB. for deflate, minimum length for match is 3, max length is 285, maximum distance is 32K. eg. LZ77 compression of the input 500#'abc' is aa(285,1)(213,1)b(285,1)(214,1)c(285,1)(214,1) inside each bracket is (length,distance) pair applied to the history buffer. Untested but should be something like this. (text from rfc) The compressor uses a chained hash table to find duplicated strings, using a hash function that operates on 3-byte sequences. At any given point during compression, let XYZ be the next 3 input bytes to be examined (not necessarily all different, of course). First, the compressor examines the hash chain for XYZ. If the chain is empty, the compressor simply writes out X as a literal byte and advances one byte in the input. If the hash chain is not empty, indicating that the sequence XYZ (or, if we are unlucky, some other 3 bytes with the same hash function value) has occurred recently, the compressor compares all strings on the XYZ hash chain with the actual input data sequence starting at the current point, and selects the longest match. The compressor searches the hash chains starting with the most recent strings, to favor small distances and thus take advantage of the Huffman encoding. The hash chains are singly linked. There are no deletions from the hash chains; the algorithm simply discards matches that are too old. To avoid a worst-case situation, very long hash chains are arbitrarily truncated at a certain length, determined by a run-time parameter. ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
