On Thu, Aug 24, 2023 at 10:35 AM James Bowery <[email protected]> wrote:
> On Wed, Aug 23, 2023 at 9:02 PM Matt Mahoney <[email protected]> wrote:
>>
>> More progress: Previously I compressed LaboratoryOfTheCounties by
>> converting the spreadsheet to 32 bit integers, transposing to column
>> order, and compressing using the previous column as context. I
>> developed a fast and slow model using zpaq.
>>
>> 23,369,801 x-ms7c0.4.255i2.4c0.4.13795.255.255i1.3amst.zpaq (73s)
>> 23,918,910 x-ms7c0.4.255.255c0.4.13795.255.255am.zpaq (27s)
>
>
> The details are a bit mysterious to me.  It might be clearer to use the words 
> "cases" and "variables" rather than "rows" and "columns".  Does "transposing 
> to column order" result in "variable major order" after you have sorted 
> variables based on their mean values to enhance compression as in "previous 
> variable as context"?

Yes. Rows are cases and columns are variables. I only transpose them
in the encoder output and decoder input.

> A technical question on zpaq's ability to achieve what I assume amounts to 
> delta-coding for compression:
>
> Is zpaq, in effect, learning to parse a bit string in groups of 32 bits and 
> then performing arithmetic operations on the sequence of those groups treated 
> as integers to produce the deltas?  That's pretty impressive if so.

No delta coding yet. I am compressing in 2 stages. The first stage
converts the input to a form that is simpler to compress using a
context model using a C++ program I am still writing. The second stage
uses zpaq to compress using the advanced compression options to
describe a 2 dimensional context model. I last updated zpaq in 2016.

The first stage converts all the text values to 32 bit integers (big
endian MSB first), subtracts the counties from the state totals, sorts
the rows (cases) by 2001 population (column 42), sorts the columns
(variables) by information distance (log(|x-y|+1 or bit length of the
difference), then writes the file by column instead of rows. The zpaq
model uses combinations of the file position mod 4 (2 bits), the last
4 bytes of the current and previous values and the last 2 bytes of the
previous column but current row as context. The context is hashed and
used to look up a bit history, which is mapped by a  table to a
probability that the next bit is 1. These predictions are mixed
together using ISSE chains and mixers (single neurons), passed through
a SSE (a table mapping an order 0 context and prediction to a new
prediction), and mixed once more.

Decompression is the reverse. zpaq makes the same predictions and uses
them to decode the file. Then the decoder reads back from the file in
column order, sorts the rows and columns on the first column and first
row (which are sorted in the original file), adds the county data back
to the states, and converts back to a text file.

> I suppose converting to integer requires scaling data that has fractional 
> values, which means that to recover the original data requires retaining the 
> scaling factors for those data.  So you're retaining that factor for each 
> variable, right?

The encoder makes 2 passes, first to determine the maximum number of
decimal places (0, 1, 2), and then it multiplies by 10 or 100 to
convert to integers. Column 0 containing the 2 digit state code and 3
digit county code is stored as an int. The top row variable names have
one of 51 3 letter codes followed by a 6 digit number. I convert the
code to a number 1 to 51 followed by 6 digits. This requires 26 bits.
Since I have some bits to spare, I store the format (2 bits) and the
low 3 bits of row 1 (US totals) there because some of these numbers
require up to 35 bits. The list of 51 3 letter codes will add about
150 bytes to the compressed size as a self extracting archive.

>> Sample every 16 rows, 4 bit log approx
>> 21,906,733 x-ms7c0.4.255i2.4c0.4.13795.255.255i1.3amst.zpaq
>> 22,426,957 x-ms7c0.4.255.255c0.4.13795.255.255am.zpaq
>
> That's a compression ratio of  4.2 to 1.  Pretty impressive given the lack of 
> any explicit use of statistical analysis techniques normally used in deriving 
> relationships.

Ideally, all the prediction should be done in the first stage so the
compressor only needs an adaptive order 0 model. Subtracting adjacent
rows without sorting made compression worse, but I think a better
predictor using linear combinations of other variables will be the way
to go. Meanwhile I was playing with the context model and got some
improved results using the last 3 variables as context.

21,647,015 
x-ms7c0.4.255i2.3c0.4.13795.255.255.i1.1.3c0.4.26591.255.255i1.2.3c0.4.13795.255.26591.255i2.2amst.zpaq
21,766,492 
x-ms7c0.4.255i2.3c0.4.13795.255.255.i1.1.3c0.4.26591.255.255i1.2.3amst.zpaq
21,906,733 x-ms7c0.4.255i2.4c0.4.13795.255.255i1.3amst.zpaq
22,426,957 x-ms7c0.4.255.255c0.4.13795.255.255am.zpaq

The contexts for the best result are:
c0.4.255i2.3 = last 1, 3, and 6 bytes using an ICM-ISSE chain.
Contexts implicitly also include the byte position within the int (2
bits) and the previously coded bits of the partially coded current
byte in predicting the next bit.
c0.4.13795.255.255.i1.1.3 = The current and previous byte in the
previous column, mixed with the last 0, 1, 2, and 5 bytes in the
current column.
c0.4.26591.255.255i1.2.3 = The current and previous bytes 2 columns
back, mixed with the last 0, 1, 3, 6 bytes.
c0.4.13795.255.26591.255i2.2 = The current bytes in the first and
third columns back mixed with the last 0, 2, 4 bytes.
amst = a match model, mixer, SSE (order 0 probability adjustment
table), and a 2 input mixer.

The advanced options are described on the man page at
http://mattmahoney.net/dc/zpaq.html and the archive format in the
specification.
zpaq encodes a description of the decompression algorithm in the
archive header. The description consists of a list of components like
ICM, ISSE, and mixers, a program to compute the context hashes, and a
postprocessor program to reverse the conversion of an external
encoder. Both programs would be written in ZPAQL, a virtual machine
with an assembler like language. But it is a lot easier to write the
decode in C++. I have written LZ77 and BWT decoders in ZPAQL, but
those were fairly simple. I will probably write a specialized
compressor and use libzpaq with a custom model to compress internally.
The advanced options automatically generate ZPAQL code but I may have
to write my own to use better contexts. I would like to use just the
leading bits of neighboring values, but that will take custom ZPAQL
code.

-- Matt Mahoney, [email protected]

------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/T30092c5d8380b42f-Mc806d732d180d55900230b7e
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Reply via email to