More progress: Previously I compressed LaboratoryOfTheCounties by converting the spreadsheet to 32 bit integers, transposing to column order, and compressing using the previous column as context. I developed a fast and slow model using zpaq.
23,369,801 x-ms7c0.4.255i2.4c0.4.13795.255.255i1.3amst.zpaq (73s) 23,918,910 x-ms7c0.4.255.255c0.4.13795.255.255am.zpaq (27s) Row 1 is statistics for the US. Subsequent rows are states, each followed by the counties in that state. An obvious improvement is to predict the state statistics by adding up the county statistics and subtracting the sum. Some of the statistics are averages rather than sums, but it is easy to tell which by comparing the US statistics in row 1 with the first county of the first state in row 3. If the US is less than 8 times larger then I assume the statistic is a state average and I don't.change it. I could also subtract the states from the US but didn't bother because it is only 1 row and I need that information to undo the change for decompression. 22,924,112 x-ms7c0.4.255i2.4c0.4.13795.255.255i1.3amst.zpaq 23,474,401 x-ms7c0.4.255.255c0.4.13795.255.255am.zpaq Next I sorted the rows by population because many other statistics are proportional.. After some experiments, I got the best result sorting by 2001 population in ascending order. This is in the middle of the range from 1980 to 2010. The reverse transform sorts on column 0, which consists of 2 digit state codes and 3 digit county codes. Rows sorted by 2001 resident population (col 42) 22,343,190 x-ms7c0.4.255i2.4c0.4.13795.255.255i1.3amst.zpaq 22,835,523 x-ms7c0.4.255.255c0.4.13795.255.255am.zpaq I tried encoding each column by subtracting the previous column, since most of the adjacent columns are highly correlated, for example, 2001 population followed by 2002 population. But this actually made compression worse with the 2-D context models used above, while improving compression with standard compressors. Subtract previous column 25,437,311 x-ms7c0.4.255i2.4c0.4.13795.255.255i1.3amst.zpaq 25,859,716 x-ms7c0.4.255.255c0.4.13795.255.255am.zpaq 26,877,561 x-ms7c0.4i2.2.4m.zpaq 27,913,916 x.pmm 29,635,981 x.7z 30,174,140 x-b100m.bsc 31,549,227 x.bz2 38,420,807 x-9.zip 84,760,704 x Subtract previous column in rows 2+ if increasing in row 1 25,075,543 x-ms7c0.4.255.255c0.4.13795.255.255am.zpaq 28,604,577 x.7z Subtract if close in row 1 24,498,885 x-ms7c0.4.255.255c0.4.13795.255.255am.zpaq (within 1/16) 24,532,441 x-ms7c0.4.255.255c0.4.13795.255.255am.zpaq (within 1/8) 24,563,025 x-ms7c0.4.255.255c0.4.13795.255.255am.zpaq (within 1/4) 24,638,619 x-ms7c0.4.255.255c0.4.13795.255.255am.zpaq (within 1/2) None of these beat doing nothing, so I abandoned that approach and sorted the columns by mutual information, roughly the bit length of the difference. This is the best result I have found so far..Looking at the sorted order, about 90% of the time the best match was already the adjacent column reading forward or backwards. Rows sorted by 2001 population and columns sorted by log(|x-y|+1) 21,880,993 rcsort-ms7c0.4.255i2.4c0.4.13795.255.255i1.3amst.zpaq 22,404,289 rcsort-ms7c0.4.255.255c0.4.13795.255.255am.zpaq By sorting, I mean pick the next column that takes the fewest bits to represent the difference. It's a greedy approximation to the traveling salesman problem where you always go to the closest city you haven't visited yet. But even using a fast integer approximation of the log function, it takes about 30 minutes to compare every pair of columns. I can speed that up to about 2 minutes by sampling every 16 rows. Sample every 16 rows, 4 bit log approx 21,906,733 x-ms7c0.4.255i2.4c0.4.13795.255.255i1.3amst.zpaq 22,426,957 x-ms7c0.4.255.255c0.4.13795.255.255am.zpaq I did not attempt to normalize the data, but that may be the next thing I try. On Tue, Aug 22, 2023 at 2:13 PM James Bowery <[email protected]> wrote: > ** It's interesting that Wikipedia doesn't have a specific "History" section > for factor analysis describing its origin. Instead they have a bunch of > sections on various applications of factor analysis that discuss the history > of its application to that specialty. Spearman committed the crime of > discovering the g-factor of intelligence, the denial of which causing > catastrophic damage to humanity because such denial is a globally imposed > religious belief that racial group differences in outcome are not > significantly influenced by heritable aspects of intelligence. I don't think you can determine that from the data. The correlations between race, education, income, and crime are well known and probably in this data, but that says nothing about whether the cause is genetics or environment. To show that random variable X affects Y, you need P(X) and P(Y). But the data only gives you the joint probability P(X, Y). ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/T30092c5d8380b42f-M4276de6a6ef197ea632d5ce8 Delivery options: https://agi.topicbox.com/groups/agi/subscription
