Re: [Jprogramming] LZ77

Raul Miller Thu, 18 Sep 2014 02:56:59 -0700

I can try taking a look at the code, one of these days, if you want me to.

(Not right this second though.)


Thanks,

-- 
Raul


On Thu, Sep 18, 2014 at 5:45 AM, bill lam <[email protected]> wrote:
> Thanks Raul,
>
> I followed you suggestion to use 2 generations of hash vector,
> Its speed is faster now although still far behind zlib.so.
>
> Ср, 17 сен 2014, Raul Miller написал(а):
>> I would ignore the stuff about "hashes" and "hash chains" and "singly
>> linked". Hashes are nice sometimes but other times they're just a
>> disaster. Sometimes that's better than the alternative, but their
>> marketing is sometimes significantly better than their actual
>> behavior.
>>
>> Anyways, what's really happening here is an exact match search and
>> some kind of constraint on the size of the table.
>>
>> I think the best way to approach that in J is to search in sorted
>> lists. The problem here is the incremental update. You don't want to
>> re-sort a large list every time you add something to the list.
>>
>> The way to approach that, I think, is to have several generations of
>> lists, with each larger than the last, and each slower to be update
>> than the last.
>>
>> So, for example, have on list which is unsorted. Use i. to look things
>> up in it, and merge it into the next list when it gets too long (maybe
>> when it has 30 items in it).
>>
>> The next list gets sorted every time it gets stuff added to it. Use I.
>> to look things up in it. Merge it into the next list when it gets too
>> long (maybe when it has 1000 items in it).
>>
>> Same for the next list, but this one, when it gets too long, you
>> arbitrarily discard items from it.
>>
>> Now the other thing that's happening, here, is that J doesn't really
>> do all that well with incremental update algorithms. You'll still be
>> slow if you do things "one group of three" at a time. But -
>> hypothetically at least - you don't have to. The ideal would be to
>> come up with an algorithm for a block of text which complies (at least
>> in terms of data structures) with the rfc and which behaves reasonably
>> close to the behavior of the rfc described behavior.
>>
>> But having a slow implementation that works is a really important
>> step. That (with some test cases - like the one you described here)
>> gives you a baseline to test against.
>>
>> Hopefully this helps?
>>
>> Thanks,
>>
>> --
>> Raul
>>
>>
>>
>>
>> On Wed, Sep 17, 2014 at 5:23 AM, bill lam <[email protected]> wrote:
>> > The LZ77 variation used in deflate is at the bottom (copied from rfc
>> > 1951).  I tried to translate it into J using a hash function which
>> > interprets XYZ as the lower 3 bytes of a 4 byte integer.  My J
>> > implementation worked but efficient is rather poor.  I think the
>> > inefficiency of my implementation came from the need to
>> > 1. build a vector of pre-computed hash of size equal to sliding
>> > window. (although the hash of all bytes had been pre-computed)
>> > 2. find the last index of the hash of current XYZ in the vector.
>> > (since the left argument changes, special code of the form m&i: does not 
>> > apply)
>> >
>> > If data are truly random, the chance of matching would be very small,
>> > thus the time spent should roughly the sum of the above 1 & 2 for
>> > every byte, however it took a very long time even for data size of
>> > only about 500KB.  Any idea for a better J implementation.
>> >
>> > NB. for deflate, minimum length for match is 3, max length is 285,
>> > maximum distance is 32K.
>> >
>> > eg. LZ77 compression of the input 500#'abc'
>> > is
>> > aa(285,1)(213,1)b(285,1)(214,1)c(285,1)(214,1)
>> >
>> > inside each bracket is (length,distance) pair applied to the  history
>> > buffer. Untested but should be something like this.
>> >
>> > (text from rfc)
>> > The compressor uses a chained hash table to find duplicated strings,
>> >    using a hash function that operates on 3-byte sequences.  At any
>> >    given point during compression, let XYZ be the next 3 input bytes to
>> >    be examined (not necessarily all different, of course).  First, the
>> >    compressor examines the hash chain for XYZ.  If the chain is empty,
>> >    the compressor simply writes out X as a literal byte and advances one
>> >    byte in the input.  If the hash chain is not empty, indicating that
>> >    the sequence XYZ (or, if we are unlucky, some other 3 bytes with the
>> >    same hash function value) has occurred recently, the compressor
>> >    compares all strings on the XYZ hash chain with the actual input data
>> >    sequence starting at the current point, and selects the longest
>> >    match.
>> >
>> >    The compressor searches the hash chains starting with the most recent
>> >    strings, to favor small distances and thus take advantage of the
>> >    Huffman encoding.  The hash chains are singly linked. There are no
>> >    deletions from the hash chains; the algorithm simply discards matches
>> >    that are too old.  To avoid a worst-case situation, very long hash
>> >    chains are arbitrarily truncated at a certain length, determined by a
>> >    run-time parameter.
>> > ----------------------------------------------------------------------
>> > For information about J forums see http://www.jsoftware.com/forums.htm
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
>
> --
> regards,
> ====================================================
> GPG key 1024D/4434BAB3 2008-08-24
> gpg --keyserver subkeys.pgp.net --recv-keys 4434BAB3
> gpg --keyserver subkeys.pgp.net --armor --export 4434BAB3
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] LZ77

Reply via email to