On Wed, Aug 23, 2023 at 9:02 PM Matt Mahoney <[email protected]> wrote:
> More progress: Previously I compressed LaboratoryOfTheCounties by > converting the spreadsheet to 32 bit integers, transposing to column > order, and compressing using the previous column as context. I > developed a fast and slow model using zpaq. > > 23,369,801 x-ms7c0.4.255i2.4c0.4.13795.255.255i1.3amst.zpaq (73s) > 23,918,910 x-ms7c0.4.255.255c0.4.13795.255.255am.zpaq (27s) > The details are a bit mysterious to me. It might be clearer to use the words "cases" and "variables" rather than "rows" and "columns". Does "transposing to column order" result in "variable major order" after you have sorted variables based on their mean values to enhance compression as in "previous variable as context"? A technical question on zpaq's ability to achieve what I assume amounts to delta-coding for compression: Is zpaq, in effect, learning to parse a bit string in groups of 32 bits and then performing arithmetic operations on the sequence of those groups treated as integers to produce the deltas? That's pretty impressive if so. I suppose converting to integer requires scaling data that has fractional values, which means that to recover the original data requires retaining the scaling factors for those data. So you're retaining that factor for each variable, right? Sample every 16 rows, 4 bit log approx > 21,906,733 x-ms7c0.4.255i2.4c0.4.13795.255.255i1.3amst.zpaq > 22,426,957 x-ms7c0.4.255.255c0.4.13795.255.255am.zpaq > That's a compression ratio of 4.2 to 1. Pretty impressive given the lack of any explicit use of statistical analysis techniques normally used in deriving relationships. > I did not attempt to normalize the data, but that may be the next thing I > try. > > On Tue, Aug 22, 2023 at 2:13 PM James Bowery <[email protected]> wrote: > > > ** It's interesting that Wikipedia doesn't have a specific "History" > section for factor analysis describing its origin. Instead they have a > bunch of sections on various applications of factor analysis that discuss > the history of its application to that specialty. Spearman committed the > crime of discovering the g-factor of intelligence, the denial of which > causing catastrophic damage to humanity because such denial is a globally > imposed religious belief that racial group differences in outcome are not > significantly influenced by heritable aspects of intelligence. > > I don't think you can determine that from the data. The correlations > between race, education, income, and crime are well known and probably > in this data, but that says nothing about whether the cause is > genetics or environment. To show that random variable X affects Y, you > need P(X) and P(Y). But the data only gives you the joint probability > P(X, Y). "the data" referring to "this data", you're *probably* right but even when dealing with ecological variables, it is not inconceivable that a *latent* variable model may, by *assuming* reproductive dynamics in the population, produce better compression. In any event, to be clear, the damage I'm talking about results from denying data that goes way beyond this dataset. ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/T30092c5d8380b42f-M38796645fb843725e8d8c9c5 Delivery options: https://agi.topicbox.com/groups/agi/subscription
