Previously I ran some compression tests on LaboratoryOfTheCountiesUncompressed.csv from https://github.com/jabowery/HumesGuillotine/tree/master/LaboratoryOfTheCounties to establish a baseline. This is a spreadsheet with 3199 rows representing US counties and 6624 columns representing demographic, economic, and crime statistics as a text file containing decimal integers separated by tabs and newlines.
The best result, posted previously, was 27.0 MB using paq8o after about an hour. zpaq -m57 took about 3 minutes to reach 28.3 MB. I also tested ppmonstr, bcm, ppmd, 7zip, bzip2, rar and zip, all of which ran in a couple of minutes or less. Original file 27,077,896 LaboratoryOfTheCountiesUncompressed.csv.paq8o 28,322,300 LaboratoryOfTheCountiesUncompressed-m57.zpaq 28,449,825 LaboratoryOfTheCountiesUncompressed-m5.zpaq 28,741,625 LaboratoryOfTheCountiesUncompressed.pmm 29,521,520 LaboratoryOfTheCountiesUncompressed-b100m.bcm 30,380,751 LaboratoryOfTheCountiesUncompressed-m4.zpaq 30,380,751 LaboratoryOfTheCountiesUncompressed-m3.zpaq 33,305,581 LaboratoryOfTheCountiesUncompressed-m256-o16-r1.pmd 33,311,991 LaboratoryOfTheCountiesUncompressed.csv.7z 34,559,264 LaboratoryOfTheCountiesUncompressed.csv.bz2 36,253,433 LaboratoryOfTheCountiesUncompressed-m5.rar 38,504,064 LaboratoryOfTheCountiesUncompressed-9.zip 40,647,091 LaboratoryOfTheCountiesUncompressed-m2.zpaq 43,370,210 LaboratoryOfTheCountiesUncompressed-m1.zpaq 91,360,518 LaboratoryOfTheCountiesUncompressed.csv paq8o and zpaq (both of which I wrote) with max compression (-m57) are context mixing compressors that predict one bit at a time and arithmetic code the bits. Most of the compression comes from ICM-ISSE chains with increasing context lengths. An ICM (indirect context model) maps a context hash to a bit history, an 8 bit state representing bounded bit counts and the last bit. The state is mapped by a table to a probability that the next bit is 1 and updated when the bit is known. An ISSE (indirect secondary symbol estimator) maps a longer context to a bit history, then uses that state to select the 2 weights for a mixer that averages the prediction with a constant 1, then updates the weights to improve the prediction. It is a small neural network with one 2-input neuron which performs a weighted sum in the logistic domain, x = ln(p/(1-p)), then squashes the output back to (0,1) as p = 1/(1 + exp(-x)). paq8o auto detects periodic 2-D structure such as in images and databases to combine 2-D contexts to the left and up using more ICM-ISSE chains and then mixes everything together with a larger neural network. But this does not work well with variable length cells in a spreadsheet like this file. So I wrote a program to convert all the cells to 4 byte integers, LSB first. For columns with a decimal format, I multiplied by 10 or 100 to obtain integers. Testing with zpaq, ppmonstr, bcm, 7zip, bzip2, and zip: As 32 bit ints, LSB first (including encoded headers) 26,485,274 lbin-ms7c0.4i2.2.4c0.4.27495.255i1.3amst.zpaq 27,051,444 lbin.pmm 27,107,298 lbin-m57.zpaq 27,446,322 lbin-b100m.bcm 30,579,168 lbin.7z 32,538,726 lbin.bz2 41,043,250 lbin-9.zip 84,760,704 lbin The best result was zpaq with a custom model with the long option shown. It describes the components in a context mixing model. -m compression model. - s streaming mode (no dedupe or incremental update). - 7 use 2^7 MB blocks so the whole file fits in one block. Otherwise the file is split into smaller blocks that are compressed independently in different threads. - c0 select an ICM - .4 the context is the position in the file mod 4, the byte position within an integer. - i2.2.4 an ISSE chain taking as context the hashes of the last 2, last 4, and last 8 bytes. The numbers are how much to extend the context length hash. - c0.4.27495.255 Another ICM using context position mod 4 and the byte 26496 bytes back, which corresponds to the byte being predicted in the previous row. It means skip 27495-1000 bytes and AND the next byte with 255 (0xFF). - i1.3 extends the ICM with an ISSE chain using hashes of the last 1 and 4 bytes as context. - a is a match model. It searches for long context matches and predicts whatever bit followed in the match. - m Mix all the predictions of the previous components using a neural network. - s Secondary symbol estimator. The quantized prediction and a bytewise order 0 context (the bits of the current byte already coded) index a 2-D table of predictions with interpolation. The closest entry is updated to improve predictions. - t Mix the last 2 components. This is the final output, which is arithmetic coded. The result was a 0.6 MB improvement in 80 seconds. Next I tried saving in big endian format (MSB first), which improves another 0.2 MB. MSB first 26,227,245 lbig-ms7c0.4i2.2.4c0.4.27495.255i1.3amst.zpaq 27,012,457 lbig.pmm 29,906,616 lbig.7z 41,563,166 lbig-9.zip 84,760,704 lbig I experimented with saving the low byte in base 250 instead of 256. This way, any data that is a multiple of 1000 will have the low 10 bits all zero. But there was not much gain so I decided not to use this approach. MSB first, low byte in base 250 26,221,176 l250-ms7c0.4i2.2.4c0.4.27495.255i1.3amst.zpaq 29,923,226 l250.7z 27,112,414 l250.pmm 41,559,128 l250-9.zip 84,760,704 l250 Then I transposed the rows and columns and hit the jackpot with a 2.7 MB improvement. MSB first, transpose rows and columns 23,584,783 lt-ms7c0.4i2.2.4c0.4.13795.255i1.3amst.zpaq 28,982,658 lt-m57.zpaq 29,946,408 lt.pmm 84,760,704 lt The option c0.4.13795.255 skips back to the previous column instead of the previous row, but is otherwise the same. But I hadn't written the decoder yet, so there still might be a bug causing the good compression. There are a lot of annoying details to make the decoder work. The first row and column are strings (lexicographically sorted, which means I can reorder them and have the decoder sort them). Some of the numbers in row 1 need 35 bits to store, but I had 5 bits to spare in row 0 (headers), so I used that space to save the 3 low bits plus 2 bits to encode the format (0-2 decimal places). The decoder has to trim off trailing decimal zeros, and there are a few entries in scientific notation when it is shorter, like 3e+05 instead of 300000. When I got it all working, the results were almost the same. Debugged decoder 23,585,311 x-ms7c0.4i2.2.4c0.4.13795.255i1.3amst.zpaq 23,617,589 x-ms7c0.4i2.2.4c0.4.13795.255i1.3am.zpaq 23,862,238 x-ms7c0.4i2.2.4c0.4.13795.255i1.3m.zpaq 24,008,898 x-ms7c0.4i2.2.4c0.4.13795.255i1m.zpaq 24,536,086 x-ms7c0.4i2.2.4c0.4.13795.255m.zpaq 28,374,590 x.7z 29,564,353 x-ms7c0.4i2.2.4m.zpaq 29,628,740 x-ms7c0.4i2.2.4.zpaq 29,937,953 x.pmm 29,938,573 x-ms7c0.4i2.2.zpaq 30,500,770 x-ms7c0.4i3.zpaq 30,585,036 x-ms7c0.4i2.zpaq 30,984,462 x-b100m.bsc 31,358,926 x-ms7c0.4.zpaq 34,618,512 x.bz2 40,197,235 x-ms7c0.zpaq 41,443,814 x-9.zip 84,760,704 x In this test I tried removing the zpaq components one at a time. Most of the compression comes from using the previous column as context. Similar columns are grouped, like 1990 population followed by 2000 population, which is what makes these contexts useful. Obviously there are other things to try, like sorting rows and columns by mutual information, or predicting cells from previously coded cells and coding the difference. Stay tuned. -- -- Matt Mahoney, [email protected] ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/T30092c5d8380b42f-Ma1faf5bc88e9bf3d684bdee7 Delivery options: https://agi.topicbox.com/groups/agi/subscription
