More progress: Previously I compressed LaboratoryOfTheCounties by
converting the spreadsheet to 32 bit integers, transposing to column
order, and compressing using the previous column as context. I
developed a fast and slow model using zpaq.

23,369,801 x-ms7c0.4.255i2.4c0.4.13795.255.255i1.3amst.zpaq (73s)
23,918,910 x-ms7c0.4.255.255c0.4.13795.255.255am.zpaq (27s)

Row 1 is statistics for the US. Subsequent rows are states, each
followed by the counties in that state. An obvious improvement is to
predict the state statistics by adding up the county statistics and
subtracting the sum. Some of the statistics are averages rather than
sums, but it is easy to tell which by comparing the US statistics in
row 1 with the first county of the first state in row 3. If the US is
less than 8 times larger then I assume the statistic is a state
average and I don't.change it. I could also subtract the states from
the US but didn't bother because it is only 1 row and I need that
information to undo the change for decompression.

22,924,112 x-ms7c0.4.255i2.4c0.4.13795.255.255i1.3amst.zpaq
23,474,401 x-ms7c0.4.255.255c0.4.13795.255.255am.zpaq

Next I sorted the rows by population because many other statistics are
proportional.. After some experiments, I got the best result sorting
by 2001 population in ascending order. This is in the middle of the
range from 1980 to 2010. The reverse transform sorts on column 0,
which consists of 2 digit state codes and 3 digit county codes.

Rows sorted by 2001 resident population (col 42)
22,343,190 x-ms7c0.4.255i2.4c0.4.13795.255.255i1.3amst.zpaq
22,835,523 x-ms7c0.4.255.255c0.4.13795.255.255am.zpaq

I tried encoding each column by subtracting the previous column, since
most of the adjacent columns are highly correlated, for example, 2001
population followed by 2002 population. But this actually made
compression worse with the 2-D context models used above, while
improving compression with standard compressors.

Subtract previous column
25,437,311 x-ms7c0.4.255i2.4c0.4.13795.255.255i1.3amst.zpaq
25,859,716 x-ms7c0.4.255.255c0.4.13795.255.255am.zpaq
26,877,561 x-ms7c0.4i2.2.4m.zpaq
27,913,916 x.pmm
29,635,981 x.7z
30,174,140 x-b100m.bsc
31,549,227 x.bz2
38,420,807 x-9.zip
84,760,704 x

Subtract previous column in rows 2+ if increasing in row 1
25,075,543 x-ms7c0.4.255.255c0.4.13795.255.255am.zpaq
28,604,577 x.7z

Subtract if close in row 1
24,498,885 x-ms7c0.4.255.255c0.4.13795.255.255am.zpaq (within 1/16)
24,532,441 x-ms7c0.4.255.255c0.4.13795.255.255am.zpaq (within 1/8)
24,563,025 x-ms7c0.4.255.255c0.4.13795.255.255am.zpaq (within 1/4)
24,638,619 x-ms7c0.4.255.255c0.4.13795.255.255am.zpaq (within 1/2)

None of these beat doing nothing, so I abandoned that approach and
sorted the columns by mutual information, roughly the bit length of
the difference. This is the best result I have found so far..Looking
at the sorted order, about 90% of the time the best match was already
the adjacent column reading forward or backwards.

Rows sorted by 2001 population and columns sorted by log(|x-y|+1)
21,880,993 rcsort-ms7c0.4.255i2.4c0.4.13795.255.255i1.3amst.zpaq
22,404,289 rcsort-ms7c0.4.255.255c0.4.13795.255.255am.zpaq

By sorting, I mean pick the next column that takes the fewest bits to
represent the difference. It's a greedy approximation to the traveling
salesman problem where you always go to the closest city you haven't
visited yet. But even using a fast integer approximation of the log
function, it takes about 30 minutes to compare every pair of columns.
I can speed that up to about 2 minutes by sampling every 16 rows.

Sample every 16 rows, 4 bit log approx
21,906,733 x-ms7c0.4.255i2.4c0.4.13795.255.255i1.3amst.zpaq
22,426,957 x-ms7c0.4.255.255c0.4.13795.255.255am.zpaq

I did not attempt to normalize the data, but that may be the next thing I try.

On Tue, Aug 22, 2023 at 2:13 PM James Bowery <[email protected]> wrote:

> ** It's interesting that Wikipedia doesn't have a specific "History" section 
> for factor analysis describing its origin.  Instead they have a bunch of 
> sections on various applications of factor analysis that discuss the history 
> of its application to that specialty.  Spearman committed the crime of 
> discovering the g-factor of intelligence, the denial of which causing 
> catastrophic damage to humanity because such denial is a globally imposed 
> religious belief that racial group differences in outcome are not 
> significantly influenced by heritable aspects of intelligence.

I don't think you can determine that from the data. The correlations
between race, education, income, and crime are well known and probably
in this data, but that says nothing about whether the cause is
genetics or environment. To show that random variable X affects Y, you
need P(X) and P(Y). But the data only gives you the joint probability
P(X, Y).

------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/T30092c5d8380b42f-M4276de6a6ef197ea632d5ce8
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Reply via email to