I can try taking a look at the code, one of these days, if you want me to. (Not right this second though.)
Thanks, -- Raul On Thu, Sep 18, 2014 at 5:45 AM, bill lam <[email protected]> wrote: > Thanks Raul, > > I followed you suggestion to use 2 generations of hash vector, > Its speed is faster now although still far behind zlib.so. > > Ср, 17 сен 2014, Raul Miller написал(а): >> I would ignore the stuff about "hashes" and "hash chains" and "singly >> linked". Hashes are nice sometimes but other times they're just a >> disaster. Sometimes that's better than the alternative, but their >> marketing is sometimes significantly better than their actual >> behavior. >> >> Anyways, what's really happening here is an exact match search and >> some kind of constraint on the size of the table. >> >> I think the best way to approach that in J is to search in sorted >> lists. The problem here is the incremental update. You don't want to >> re-sort a large list every time you add something to the list. >> >> The way to approach that, I think, is to have several generations of >> lists, with each larger than the last, and each slower to be update >> than the last. >> >> So, for example, have on list which is unsorted. Use i. to look things >> up in it, and merge it into the next list when it gets too long (maybe >> when it has 30 items in it). >> >> The next list gets sorted every time it gets stuff added to it. Use I. >> to look things up in it. Merge it into the next list when it gets too >> long (maybe when it has 1000 items in it). >> >> Same for the next list, but this one, when it gets too long, you >> arbitrarily discard items from it. >> >> Now the other thing that's happening, here, is that J doesn't really >> do all that well with incremental update algorithms. You'll still be >> slow if you do things "one group of three" at a time. But - >> hypothetically at least - you don't have to. The ideal would be to >> come up with an algorithm for a block of text which complies (at least >> in terms of data structures) with the rfc and which behaves reasonably >> close to the behavior of the rfc described behavior. >> >> But having a slow implementation that works is a really important >> step. That (with some test cases - like the one you described here) >> gives you a baseline to test against. >> >> Hopefully this helps? >> >> Thanks, >> >> -- >> Raul >> >> >> >> >> On Wed, Sep 17, 2014 at 5:23 AM, bill lam <[email protected]> wrote: >> > The LZ77 variation used in deflate is at the bottom (copied from rfc >> > 1951). I tried to translate it into J using a hash function which >> > interprets XYZ as the lower 3 bytes of a 4 byte integer. My J >> > implementation worked but efficient is rather poor. I think the >> > inefficiency of my implementation came from the need to >> > 1. build a vector of pre-computed hash of size equal to sliding >> > window. (although the hash of all bytes had been pre-computed) >> > 2. find the last index of the hash of current XYZ in the vector. >> > (since the left argument changes, special code of the form m&i: does not >> > apply) >> > >> > If data are truly random, the chance of matching would be very small, >> > thus the time spent should roughly the sum of the above 1 & 2 for >> > every byte, however it took a very long time even for data size of >> > only about 500KB. Any idea for a better J implementation. >> > >> > NB. for deflate, minimum length for match is 3, max length is 285, >> > maximum distance is 32K. >> > >> > eg. LZ77 compression of the input 500#'abc' >> > is >> > aa(285,1)(213,1)b(285,1)(214,1)c(285,1)(214,1) >> > >> > inside each bracket is (length,distance) pair applied to the history >> > buffer. Untested but should be something like this. >> > >> > (text from rfc) >> > The compressor uses a chained hash table to find duplicated strings, >> > using a hash function that operates on 3-byte sequences. At any >> > given point during compression, let XYZ be the next 3 input bytes to >> > be examined (not necessarily all different, of course). First, the >> > compressor examines the hash chain for XYZ. If the chain is empty, >> > the compressor simply writes out X as a literal byte and advances one >> > byte in the input. If the hash chain is not empty, indicating that >> > the sequence XYZ (or, if we are unlucky, some other 3 bytes with the >> > same hash function value) has occurred recently, the compressor >> > compares all strings on the XYZ hash chain with the actual input data >> > sequence starting at the current point, and selects the longest >> > match. >> > >> > The compressor searches the hash chains starting with the most recent >> > strings, to favor small distances and thus take advantage of the >> > Huffman encoding. The hash chains are singly linked. There are no >> > deletions from the hash chains; the algorithm simply discards matches >> > that are too old. To avoid a worst-case situation, very long hash >> > chains are arbitrarily truncated at a certain length, determined by a >> > run-time parameter. >> > ---------------------------------------------------------------------- >> > For information about J forums see http://www.jsoftware.com/forums.htm >> ---------------------------------------------------------------------- >> For information about J forums see http://www.jsoftware.com/forums.htm > > -- > regards, > ==================================================== > GPG key 1024D/4434BAB3 2008-08-24 > gpg --keyserver subkeys.pgp.net --recv-keys 4434BAB3 > gpg --keyserver subkeys.pgp.net --armor --export 4434BAB3 > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
