Previously I ran some compression tests on
LaboratoryOfTheCountiesUncompressed.csv from
https://github.com/jabowery/HumesGuillotine/tree/master/LaboratoryOfTheCounties
to establish a baseline. This is a spreadsheet with 3199 rows
representing US counties and 6624 columns representing demographic,
economic, and crime statistics as a text file containing decimal
integers separated by tabs and newlines.

The best result, posted previously, was 27.0 MB using paq8o after
about an hour. zpaq -m57 took about 3 minutes to reach 28.3 MB. I also
tested ppmonstr, bcm, ppmd, 7zip, bzip2, rar and zip, all of which ran
in a couple of minutes or less.

Original file
27,077,896 LaboratoryOfTheCountiesUncompressed.csv.paq8o
28,322,300 LaboratoryOfTheCountiesUncompressed-m57.zpaq
28,449,825 LaboratoryOfTheCountiesUncompressed-m5.zpaq
28,741,625 LaboratoryOfTheCountiesUncompressed.pmm
29,521,520 LaboratoryOfTheCountiesUncompressed-b100m.bcm
30,380,751 LaboratoryOfTheCountiesUncompressed-m4.zpaq
30,380,751 LaboratoryOfTheCountiesUncompressed-m3.zpaq
33,305,581 LaboratoryOfTheCountiesUncompressed-m256-o16-r1.pmd
33,311,991 LaboratoryOfTheCountiesUncompressed.csv.7z
34,559,264 LaboratoryOfTheCountiesUncompressed.csv.bz2
36,253,433 LaboratoryOfTheCountiesUncompressed-m5.rar
38,504,064 LaboratoryOfTheCountiesUncompressed-9.zip
40,647,091 LaboratoryOfTheCountiesUncompressed-m2.zpaq
43,370,210 LaboratoryOfTheCountiesUncompressed-m1.zpaq
91,360,518 LaboratoryOfTheCountiesUncompressed.csv

paq8o and zpaq (both of which I wrote) with max compression (-m57) are
context mixing compressors that predict one bit at a time and
arithmetic code the bits. Most of the compression comes from ICM-ISSE
chains with increasing context lengths. An ICM (indirect context
model) maps a context hash to a bit history, an 8 bit state
representing bounded bit counts and the last bit. The state is mapped
by a table to a probability that the next bit is 1 and updated when
the bit is known. An ISSE (indirect secondary symbol estimator) maps a
longer context to a bit history, then uses that state to select the 2
weights for a mixer that averages the prediction with a constant 1,
then updates the weights to improve the prediction. It is a small
neural network with one 2-input neuron which performs a weighted sum
in the logistic domain, x = ln(p/(1-p)), then squashes the output back
to (0,1) as p = 1/(1 + exp(-x)).

paq8o auto detects periodic 2-D structure such as in images and
databases to combine 2-D contexts to the left and up using more
ICM-ISSE chains and then mixes everything together with a larger
neural network. But this does not work well with variable length cells
in a spreadsheet like this file. So I wrote a program to convert all
the cells to 4 byte integers, LSB first. For columns with a decimal
format, I multiplied by 10 or 100 to obtain integers. Testing with
zpaq, ppmonstr, bcm, 7zip, bzip2, and zip:

As 32 bit ints, LSB first (including encoded headers)
26,485,274 lbin-ms7c0.4i2.2.4c0.4.27495.255i1.3amst.zpaq
27,051,444 lbin.pmm
27,107,298 lbin-m57.zpaq
27,446,322 lbin-b100m.bcm
30,579,168 lbin.7z
32,538,726 lbin.bz2
41,043,250 lbin-9.zip
84,760,704 lbin

The best result was zpaq with a custom model with the long option
shown. It describes the components in a context mixing model.

-m compression model.
- s streaming mode (no dedupe or incremental update).
- 7 use 2^7 MB blocks so the whole file fits in one block. Otherwise
the file is split into smaller blocks that are compressed
independently in different threads.
- c0 select an ICM
- .4 the context is the position in the file mod 4, the byte position
within an integer.
- i2.2.4 an ISSE chain taking as context the hashes of the last 2,
last 4, and last 8 bytes. The numbers are how much to extend the
context length hash.
- c0.4.27495.255 Another ICM using context position mod 4 and the byte
26496 bytes back, which corresponds to the byte being predicted in the
previous row. It means skip 27495-1000 bytes and AND the next byte
with 255 (0xFF).
- i1.3 extends the ICM with an ISSE chain using hashes of the last 1
and 4 bytes as context.
- a is a match model. It searches for long context matches and
predicts whatever bit followed in the match.
- m Mix all the predictions of the previous components using a neural network.
- s Secondary symbol estimator. The quantized prediction and a
bytewise order 0 context (the bits of the current byte already coded)
index a 2-D table of predictions with interpolation. The closest entry
is updated to improve predictions.
- t Mix the last 2 components. This is the final output, which is
arithmetic coded.

The result was a 0.6 MB improvement in 80 seconds. Next I tried saving
in big endian format (MSB first), which improves another 0.2 MB.

MSB first
26,227,245 lbig-ms7c0.4i2.2.4c0.4.27495.255i1.3amst.zpaq
27,012,457 lbig.pmm
29,906,616 lbig.7z
41,563,166 lbig-9.zip
84,760,704 lbig

I experimented with saving the low byte in base 250 instead of 256.
This way, any data that is a multiple of 1000 will have the low 10
bits all zero. But there was not much gain so I decided not to use
this approach.

MSB first, low byte in base 250
26,221,176 l250-ms7c0.4i2.2.4c0.4.27495.255i1.3amst.zpaq
29,923,226 l250.7z
27,112,414 l250.pmm
41,559,128 l250-9.zip
84,760,704 l250

Then I transposed the rows and columns and hit the jackpot with a 2.7
MB improvement.

MSB first, transpose rows and columns
23,584,783 lt-ms7c0.4i2.2.4c0.4.13795.255i1.3amst.zpaq
28,982,658 lt-m57.zpaq
29,946,408 lt.pmm
84,760,704 lt

The option c0.4.13795.255 skips back to the previous column instead of
the previous row, but is otherwise the same. But I hadn't written the
decoder yet, so there still might be a bug causing the good
compression. There are a lot of annoying details to make the decoder
work. The first row and column are strings (lexicographically sorted,
which means I can reorder them and have the decoder sort them). Some
of the numbers in row 1 need 35 bits to store, but I had 5 bits to
spare in row 0 (headers), so I used that space to save the 3 low bits
plus 2 bits to encode the format (0-2 decimal places). The decoder has
to trim off trailing decimal zeros, and there are a few entries in
scientific notation when it is shorter, like 3e+05 instead of 300000.
When I got it all working, the results were almost the same.

Debugged decoder
23,585,311 x-ms7c0.4i2.2.4c0.4.13795.255i1.3amst.zpaq
23,617,589 x-ms7c0.4i2.2.4c0.4.13795.255i1.3am.zpaq
23,862,238 x-ms7c0.4i2.2.4c0.4.13795.255i1.3m.zpaq
24,008,898 x-ms7c0.4i2.2.4c0.4.13795.255i1m.zpaq
24,536,086 x-ms7c0.4i2.2.4c0.4.13795.255m.zpaq
28,374,590 x.7z
29,564,353 x-ms7c0.4i2.2.4m.zpaq
29,628,740 x-ms7c0.4i2.2.4.zpaq
29,937,953 x.pmm
29,938,573 x-ms7c0.4i2.2.zpaq
30,500,770 x-ms7c0.4i3.zpaq
30,585,036 x-ms7c0.4i2.zpaq
30,984,462 x-b100m.bsc
31,358,926 x-ms7c0.4.zpaq
34,618,512 x.bz2
40,197,235 x-ms7c0.zpaq
41,443,814 x-9.zip
84,760,704 x

In this test I tried removing the zpaq components one at a time. Most
of the compression comes from using the previous column as context.
Similar columns are grouped, like 1990 population followed by 2000
population, which is what makes these contexts useful. Obviously there
are other things to try, like sorting rows and columns by mutual
information, or predicting cells from previously coded cells and
coding the difference. Stay tuned.

-- 
-- Matt Mahoney, [email protected]

------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/T30092c5d8380b42f-Ma1faf5bc88e9bf3d684bdee7
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Reply via email to