On Sat, 12 Aug 2023 at 09:30, kekronbekron
<[email protected]> wrote:
>
> From the thread...
>
> "The algorithm is reasonably well documented, and the encapsulation is not 
> complex. And
> as I said, the patents have expired."

That was me writing, 15 years ago.

> Well-documented where?

The *algorithm* is in the patent, but I'm not sure that IBM's
terse/deterse/TRSMAIN/AMATERSE etc. implementation of LZW is fully
publically documented. But there is enough out there to write your own
by starting with the patent and then experimenting, as several people
have done.

Some more snippets of stuff I wrote on lists and private emails around
the same time:

The pointer http://marknelson.us/1989/10/01/lzw-data-compression/
is a good one. (This is now 404, but available on the Wayback Machine).
The Wikipedia articles on LZW are also useful, and they have some
tables that help to get
the idea across. Who the "W" is in LZW is open to debate. Most people say it's
Welch, but IBM would say it's Wegman. There is some discussion in the
comp.compression newsgroup on this by Wegman himself. It seems both Ws invented
roughly the same thing at roughly the same time, and somehow both
patents managed to be issued (which says a lot about the US patent
system). Regardless, both patents are now
long expired.

The IBM US patent (Wegman - not Welch)  is 4814746, available at
Google patents, or
http://www.freepatentsonline.com/4814746.html or of course any other
source for US patents.

It includes a sample PL/I program that claims to implement the
invention, but it is full of errors and omissions, has some dead code,
and has been OCR'd from something in a manner I can only believe was
intended to make it difficult to copy. Nonetheless I re-OCR'd and
corrected it to the point of readibility, but not compileability.

Another place to look for assembler source for LZW is the VMARC
program. Google something like "vmarc" "lzw" to get started. There's a
bunch of discussion on IBM-VM going back many years.
[...]
I have spent some time looking at the output created by these programs,
and it is certainly LZW of a sort, although there are two quite
different implementations (the PACK and SPACK options, where SPACK
generally outperforms PACK). I have done
the basic analysis of the headers, and some of the actual
decompression algorithms (based just on data - *not* disassembling IBM code).
[...]
I was able to write a mostly complete decoder in OO REXX, but that was, again,
around 15 years ago, and I've forgotten most of the details.

I can try to dig up what I have, but it'll take a while.

The layout of a tersed file is very simple - there is a 12-byte*
header, followed by a stream of 12-bit (i.e. 1 1/2 bytes) "symbols",
and terminated by a symbol of zero. Then there is typicallly a
trailer, but it seems to contain nothing necessary or even useful in
decoding the symbol stream. Notably neither the header nor trailer
contains the length of the data stream. (This makes sense in the
context of smart compressing modems, where the uncompressed data
stream is effectively infinite and hence unknowable, and which is
where I believe IBM and Unisys made their money on their respective
patents.)

(* For Flag1 = 01 below, the header is shorter. This is pretty much
"raw" LZW[egman], with no concept of records, blocks, and similar
MVS-like stuff.)

For a tersed PDS[E] there is a directory structure that is part of the
compressed data stream, and precedes the member data.

The header is roughly like this (from my days of experimenting in an
old email that I was able to dig up just now with a quick search):

XL1    Flag1;           01 = unknown 02 = PACK, 05 = SPACK, 07 =
PDS[E]/PACK, 09 = PDS[E]/SPACK
XL1    Flag2;           00 = fixed?, 01 = variable? It's 01 for V and U.
XL2    RecordLen;      For VB it seems to be 4 less than expected (no RDW?)
XL1    Flag3;           04 = F or V, 0C = FB or VB, 1C = FBS,
4C=FBA/FBM, 84=U, 8C for load module PDS
XL1    Unknown1;        usually 00, but 01 for a load module PDS
XL2    BlockSize;       For VB it does match the expected.
XL4    Unknown2;        Always seen as zero.

Flag3 meaning?
80    RECFM U
40    Carriage control=A
20    Carriage control=M
10    RECFM S
08    RECFM B (but also found with U!)
04    always on
02    never on
01    never on

As Charles just said, the algorithm is incredibly cool! But in
practice these days there are much better performing algorithms in
wide use.

Cheers... Tony H.

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN

Reply via email to