On Sat, 12 Aug 2023 at 09:30, kekronbekron <[email protected]> wrote: > > From the thread... > > "The algorithm is reasonably well documented, and the encapsulation is not > complex. And > as I said, the patents have expired."
That was me writing, 15 years ago. > Well-documented where? The *algorithm* is in the patent, but I'm not sure that IBM's terse/deterse/TRSMAIN/AMATERSE etc. implementation of LZW is fully publically documented. But there is enough out there to write your own by starting with the patent and then experimenting, as several people have done. Some more snippets of stuff I wrote on lists and private emails around the same time: The pointer http://marknelson.us/1989/10/01/lzw-data-compression/ is a good one. (This is now 404, but available on the Wayback Machine). The Wikipedia articles on LZW are also useful, and they have some tables that help to get the idea across. Who the "W" is in LZW is open to debate. Most people say it's Welch, but IBM would say it's Wegman. There is some discussion in the comp.compression newsgroup on this by Wegman himself. It seems both Ws invented roughly the same thing at roughly the same time, and somehow both patents managed to be issued (which says a lot about the US patent system). Regardless, both patents are now long expired. The IBM US patent (Wegman - not Welch) is 4814746, available at Google patents, or http://www.freepatentsonline.com/4814746.html or of course any other source for US patents. It includes a sample PL/I program that claims to implement the invention, but it is full of errors and omissions, has some dead code, and has been OCR'd from something in a manner I can only believe was intended to make it difficult to copy. Nonetheless I re-OCR'd and corrected it to the point of readibility, but not compileability. Another place to look for assembler source for LZW is the VMARC program. Google something like "vmarc" "lzw" to get started. There's a bunch of discussion on IBM-VM going back many years. [...] I have spent some time looking at the output created by these programs, and it is certainly LZW of a sort, although there are two quite different implementations (the PACK and SPACK options, where SPACK generally outperforms PACK). I have done the basic analysis of the headers, and some of the actual decompression algorithms (based just on data - *not* disassembling IBM code). [...] I was able to write a mostly complete decoder in OO REXX, but that was, again, around 15 years ago, and I've forgotten most of the details. I can try to dig up what I have, but it'll take a while. The layout of a tersed file is very simple - there is a 12-byte* header, followed by a stream of 12-bit (i.e. 1 1/2 bytes) "symbols", and terminated by a symbol of zero. Then there is typicallly a trailer, but it seems to contain nothing necessary or even useful in decoding the symbol stream. Notably neither the header nor trailer contains the length of the data stream. (This makes sense in the context of smart compressing modems, where the uncompressed data stream is effectively infinite and hence unknowable, and which is where I believe IBM and Unisys made their money on their respective patents.) (* For Flag1 = 01 below, the header is shorter. This is pretty much "raw" LZW[egman], with no concept of records, blocks, and similar MVS-like stuff.) For a tersed PDS[E] there is a directory structure that is part of the compressed data stream, and precedes the member data. The header is roughly like this (from my days of experimenting in an old email that I was able to dig up just now with a quick search): XL1 Flag1; 01 = unknown 02 = PACK, 05 = SPACK, 07 = PDS[E]/PACK, 09 = PDS[E]/SPACK XL1 Flag2; 00 = fixed?, 01 = variable? It's 01 for V and U. XL2 RecordLen; For VB it seems to be 4 less than expected (no RDW?) XL1 Flag3; 04 = F or V, 0C = FB or VB, 1C = FBS, 4C=FBA/FBM, 84=U, 8C for load module PDS XL1 Unknown1; usually 00, but 01 for a load module PDS XL2 BlockSize; For VB it does match the expected. XL4 Unknown2; Always seen as zero. Flag3 meaning? 80 RECFM U 40 Carriage control=A 20 Carriage control=M 10 RECFM S 08 RECFM B (but also found with U!) 04 always on 02 never on 01 never on As Charles just said, the algorithm is incredibly cool! But in practice these days there are much better performing algorithms in wide use. Cheers... Tony H. ---------------------------------------------------------------------- For IBM-MAIN subscribe / signoff / archive access instructions, send email to [email protected] with the message: INFO IBM-MAIN
