I had to look up Hume's guillotine (I wonder how that caught on since he isn't French). "The is–ought problem, as articulated by the Scottish philosopher and historian David Hume, arises when one makes claims about what ought to be that are based solely on statements about what is." This is fascinating, had not known of it. I didn't get the connection to compression on cursory glance.... ?
On 8/19/23, Matt Mahoney <[email protected]> wrote: > Previously I ran some compression tests on > LaboratoryOfTheCountiesUncompressed.csv from > https://github.com/jabowery/HumesGuillotine/tree/master/LaboratoryOfTheCounties > to establish a baseline. This is a spreadsheet with 3199 rows > representing US counties and 6624 columns representing demographic, > economic, and crime statistics as a text file containing decimal > integers separated by tabs and newlines. > > The best result, posted previously, was 27.0 MB using paq8o after > about an hour. zpaq -m57 took about 3 minutes to reach 28.3 MB. I also > tested ppmonstr, bcm, ppmd, 7zip, bzip2, rar and zip, all of which ran > in a couple of minutes or less. > > Original file > 27,077,896 LaboratoryOfTheCountiesUncompressed.csv.paq8o > 28,322,300 LaboratoryOfTheCountiesUncompressed-m57.zpaq > 28,449,825 LaboratoryOfTheCountiesUncompressed-m5.zpaq > 28,741,625 LaboratoryOfTheCountiesUncompressed.pmm > 29,521,520 LaboratoryOfTheCountiesUncompressed-b100m.bcm > 30,380,751 LaboratoryOfTheCountiesUncompressed-m4.zpaq > 30,380,751 LaboratoryOfTheCountiesUncompressed-m3.zpaq > 33,305,581 LaboratoryOfTheCountiesUncompressed-m256-o16-r1.pmd > 33,311,991 LaboratoryOfTheCountiesUncompressed.csv.7z > 34,559,264 LaboratoryOfTheCountiesUncompressed.csv.bz2 > 36,253,433 LaboratoryOfTheCountiesUncompressed-m5.rar > 38,504,064 LaboratoryOfTheCountiesUncompressed-9.zip > 40,647,091 LaboratoryOfTheCountiesUncompressed-m2.zpaq > 43,370,210 LaboratoryOfTheCountiesUncompressed-m1.zpaq > 91,360,518 LaboratoryOfTheCountiesUncompressed.csv > > paq8o and zpaq (both of which I wrote) with max compression (-m57) are > context mixing compressors that predict one bit at a time and > arithmetic code the bits. Most of the compression comes from ICM-ISSE > chains with increasing context lengths. An ICM (indirect context > model) maps a context hash to a bit history, an 8 bit state > representing bounded bit counts and the last bit. The state is mapped > by a table to a probability that the next bit is 1 and updated when > the bit is known. An ISSE (indirect secondary symbol estimator) maps a > longer context to a bit history, then uses that state to select the 2 > weights for a mixer that averages the prediction with a constant 1, > then updates the weights to improve the prediction. It is a small > neural network with one 2-input neuron which performs a weighted sum > in the logistic domain, x = ln(p/(1-p)), then squashes the output back > to (0,1) as p = 1/(1 + exp(-x)). > > paq8o auto detects periodic 2-D structure such as in images and > databases to combine 2-D contexts to the left and up using more > ICM-ISSE chains and then mixes everything together with a larger > neural network. But this does not work well with variable length cells > in a spreadsheet like this file. So I wrote a program to convert all > the cells to 4 byte integers, LSB first. For columns with a decimal > format, I multiplied by 10 or 100 to obtain integers. Testing with > zpaq, ppmonstr, bcm, 7zip, bzip2, and zip: > > As 32 bit ints, LSB first (including encoded headers) > 26,485,274 lbin-ms7c0.4i2.2.4c0.4.27495.255i1.3amst.zpaq > 27,051,444 lbin.pmm > 27,107,298 lbin-m57.zpaq > 27,446,322 lbin-b100m.bcm > 30,579,168 lbin.7z > 32,538,726 lbin.bz2 > 41,043,250 lbin-9.zip > 84,760,704 lbin > > The best result was zpaq with a custom model with the long option > shown. It describes the components in a context mixing model. > > -m compression model. > - s streaming mode (no dedupe or incremental update). > - 7 use 2^7 MB blocks so the whole file fits in one block. Otherwise > the file is split into smaller blocks that are compressed > independently in different threads. > - c0 select an ICM > - .4 the context is the position in the file mod 4, the byte position > within an integer. > - i2.2.4 an ISSE chain taking as context the hashes of the last 2, > last 4, and last 8 bytes. The numbers are how much to extend the > context length hash. > - c0.4.27495.255 Another ICM using context position mod 4 and the byte > 26496 bytes back, which corresponds to the byte being predicted in the > previous row. It means skip 27495-1000 bytes and AND the next byte > with 255 (0xFF). > - i1.3 extends the ICM with an ISSE chain using hashes of the last 1 > and 4 bytes as context. > - a is a match model. It searches for long context matches and > predicts whatever bit followed in the match. > - m Mix all the predictions of the previous components using a neural > network. > - s Secondary symbol estimator. The quantized prediction and a > bytewise order 0 context (the bits of the current byte already coded) > index a 2-D table of predictions with interpolation. The closest entry > is updated to improve predictions. > - t Mix the last 2 components. This is the final output, which is > arithmetic coded. > > The result was a 0.6 MB improvement in 80 seconds. Next I tried saving > in big endian format (MSB first), which improves another 0.2 MB. > > MSB first > 26,227,245 lbig-ms7c0.4i2.2.4c0.4.27495.255i1.3amst.zpaq > 27,012,457 lbig.pmm > 29,906,616 lbig.7z > 41,563,166 lbig-9.zip > 84,760,704 lbig > > I experimented with saving the low byte in base 250 instead of 256. > This way, any data that is a multiple of 1000 will have the low 10 > bits all zero. But there was not much gain so I decided not to use > this approach. > > MSB first, low byte in base 250 > 26,221,176 l250-ms7c0.4i2.2.4c0.4.27495.255i1.3amst.zpaq > 29,923,226 l250.7z > 27,112,414 l250.pmm > 41,559,128 l250-9.zip > 84,760,704 l250 > > Then I transposed the rows and columns and hit the jackpot with a 2.7 > MB improvement. > > MSB first, transpose rows and columns > 23,584,783 lt-ms7c0.4i2.2.4c0.4.13795.255i1.3amst.zpaq > 28,982,658 lt-m57.zpaq > 29,946,408 lt.pmm > 84,760,704 lt > > The option c0.4.13795.255 skips back to the previous column instead of > the previous row, but is otherwise the same. But I hadn't written the > decoder yet, so there still might be a bug causing the good > compression. There are a lot of annoying details to make the decoder > work. The first row and column are strings (lexicographically sorted, > which means I can reorder them and have the decoder sort them). Some > of the numbers in row 1 need 35 bits to store, but I had 5 bits to > spare in row 0 (headers), so I used that space to save the 3 low bits > plus 2 bits to encode the format (0-2 decimal places). The decoder has > to trim off trailing decimal zeros, and there are a few entries in > scientific notation when it is shorter, like 3e+05 instead of 300000. > When I got it all working, the results were almost the same. > > Debugged decoder > 23,585,311 x-ms7c0.4i2.2.4c0.4.13795.255i1.3amst.zpaq > 23,617,589 x-ms7c0.4i2.2.4c0.4.13795.255i1.3am.zpaq > 23,862,238 x-ms7c0.4i2.2.4c0.4.13795.255i1.3m.zpaq > 24,008,898 x-ms7c0.4i2.2.4c0.4.13795.255i1m.zpaq > 24,536,086 x-ms7c0.4i2.2.4c0.4.13795.255m.zpaq > 28,374,590 x.7z > 29,564,353 x-ms7c0.4i2.2.4m.zpaq > 29,628,740 x-ms7c0.4i2.2.4.zpaq > 29,937,953 x.pmm > 29,938,573 x-ms7c0.4i2.2.zpaq > 30,500,770 x-ms7c0.4i3.zpaq > 30,585,036 x-ms7c0.4i2.zpaq > 30,984,462 x-b100m.bsc > 31,358,926 x-ms7c0.4.zpaq > 34,618,512 x.bz2 > 40,197,235 x-ms7c0.zpaq > 41,443,814 x-9.zip > 84,760,704 x > > In this test I tried removing the zpaq components one at a time. Most > of the compression comes from using the previous column as context. > Similar columns are grouped, like 1990 population followed by 2000 > population, which is what makes these contexts useful. Obviously there > are other things to try, like sorting rows and columns by mutual > information, or predicting cells from previously coded cells and > coding the difference. Stay tuned. > > -- > -- Matt Mahoney, [email protected] ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/T30092c5d8380b42f-M399cc522aa6eb36499c25454 Delivery options: https://agi.topicbox.com/groups/agi/subscription
