On Thu, Aug 24, 2023 at 10:35 AM James Bowery <[email protected]> wrote: > On Wed, Aug 23, 2023 at 9:02 PM Matt Mahoney <[email protected]> wrote: >> >> More progress: Previously I compressed LaboratoryOfTheCounties by >> converting the spreadsheet to 32 bit integers, transposing to column >> order, and compressing using the previous column as context. I >> developed a fast and slow model using zpaq. >> >> 23,369,801 x-ms7c0.4.255i2.4c0.4.13795.255.255i1.3amst.zpaq (73s) >> 23,918,910 x-ms7c0.4.255.255c0.4.13795.255.255am.zpaq (27s) > > > The details are a bit mysterious to me. It might be clearer to use the words > "cases" and "variables" rather than "rows" and "columns". Does "transposing > to column order" result in "variable major order" after you have sorted > variables based on their mean values to enhance compression as in "previous > variable as context"?
Yes. Rows are cases and columns are variables. I only transpose them in the encoder output and decoder input. > A technical question on zpaq's ability to achieve what I assume amounts to > delta-coding for compression: > > Is zpaq, in effect, learning to parse a bit string in groups of 32 bits and > then performing arithmetic operations on the sequence of those groups treated > as integers to produce the deltas? That's pretty impressive if so. No delta coding yet. I am compressing in 2 stages. The first stage converts the input to a form that is simpler to compress using a context model using a C++ program I am still writing. The second stage uses zpaq to compress using the advanced compression options to describe a 2 dimensional context model. I last updated zpaq in 2016. The first stage converts all the text values to 32 bit integers (big endian MSB first), subtracts the counties from the state totals, sorts the rows (cases) by 2001 population (column 42), sorts the columns (variables) by information distance (log(|x-y|+1 or bit length of the difference), then writes the file by column instead of rows. The zpaq model uses combinations of the file position mod 4 (2 bits), the last 4 bytes of the current and previous values and the last 2 bytes of the previous column but current row as context. The context is hashed and used to look up a bit history, which is mapped by a table to a probability that the next bit is 1. These predictions are mixed together using ISSE chains and mixers (single neurons), passed through a SSE (a table mapping an order 0 context and prediction to a new prediction), and mixed once more. Decompression is the reverse. zpaq makes the same predictions and uses them to decode the file. Then the decoder reads back from the file in column order, sorts the rows and columns on the first column and first row (which are sorted in the original file), adds the county data back to the states, and converts back to a text file. > I suppose converting to integer requires scaling data that has fractional > values, which means that to recover the original data requires retaining the > scaling factors for those data. So you're retaining that factor for each > variable, right? The encoder makes 2 passes, first to determine the maximum number of decimal places (0, 1, 2), and then it multiplies by 10 or 100 to convert to integers. Column 0 containing the 2 digit state code and 3 digit county code is stored as an int. The top row variable names have one of 51 3 letter codes followed by a 6 digit number. I convert the code to a number 1 to 51 followed by 6 digits. This requires 26 bits. Since I have some bits to spare, I store the format (2 bits) and the low 3 bits of row 1 (US totals) there because some of these numbers require up to 35 bits. The list of 51 3 letter codes will add about 150 bytes to the compressed size as a self extracting archive. >> Sample every 16 rows, 4 bit log approx >> 21,906,733 x-ms7c0.4.255i2.4c0.4.13795.255.255i1.3amst.zpaq >> 22,426,957 x-ms7c0.4.255.255c0.4.13795.255.255am.zpaq > > That's a compression ratio of 4.2 to 1. Pretty impressive given the lack of > any explicit use of statistical analysis techniques normally used in deriving > relationships. Ideally, all the prediction should be done in the first stage so the compressor only needs an adaptive order 0 model. Subtracting adjacent rows without sorting made compression worse, but I think a better predictor using linear combinations of other variables will be the way to go. Meanwhile I was playing with the context model and got some improved results using the last 3 variables as context. 21,647,015 x-ms7c0.4.255i2.3c0.4.13795.255.255.i1.1.3c0.4.26591.255.255i1.2.3c0.4.13795.255.26591.255i2.2amst.zpaq 21,766,492 x-ms7c0.4.255i2.3c0.4.13795.255.255.i1.1.3c0.4.26591.255.255i1.2.3amst.zpaq 21,906,733 x-ms7c0.4.255i2.4c0.4.13795.255.255i1.3amst.zpaq 22,426,957 x-ms7c0.4.255.255c0.4.13795.255.255am.zpaq The contexts for the best result are: c0.4.255i2.3 = last 1, 3, and 6 bytes using an ICM-ISSE chain. Contexts implicitly also include the byte position within the int (2 bits) and the previously coded bits of the partially coded current byte in predicting the next bit. c0.4.13795.255.255.i1.1.3 = The current and previous byte in the previous column, mixed with the last 0, 1, 2, and 5 bytes in the current column. c0.4.26591.255.255i1.2.3 = The current and previous bytes 2 columns back, mixed with the last 0, 1, 3, 6 bytes. c0.4.13795.255.26591.255i2.2 = The current bytes in the first and third columns back mixed with the last 0, 2, 4 bytes. amst = a match model, mixer, SSE (order 0 probability adjustment table), and a 2 input mixer. The advanced options are described on the man page at http://mattmahoney.net/dc/zpaq.html and the archive format in the specification. zpaq encodes a description of the decompression algorithm in the archive header. The description consists of a list of components like ICM, ISSE, and mixers, a program to compute the context hashes, and a postprocessor program to reverse the conversion of an external encoder. Both programs would be written in ZPAQL, a virtual machine with an assembler like language. But it is a lot easier to write the decode in C++. I have written LZ77 and BWT decoders in ZPAQL, but those were fairly simple. I will probably write a specialized compressor and use libzpaq with a custom model to compress internally. The advanced options automatically generate ZPAQL code but I may have to write my own to use better contexts. I would like to use just the leading bits of neighboring values, but that will take custom ZPAQL code. -- Matt Mahoney, [email protected] ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/T30092c5d8380b42f-Mc806d732d180d55900230b7e Delivery options: https://agi.topicbox.com/groups/agi/subscription
