On Wed, Aug 23, 2023 at 9:02 PM Matt Mahoney <[email protected]>
wrote:

> More progress: Previously I compressed LaboratoryOfTheCounties by
> converting the spreadsheet to 32 bit integers, transposing to column
> order, and compressing using the previous column as context. I
> developed a fast and slow model using zpaq.
>
> 23,369,801 x-ms7c0.4.255i2.4c0.4.13795.255.255i1.3amst.zpaq (73s)
> 23,918,910 x-ms7c0.4.255.255c0.4.13795.255.255am.zpaq (27s)
>

The details are a bit mysterious to me.  It might be clearer to use the
words "cases" and "variables" rather than "rows" and "columns".  Does
"transposing to column order" result in "variable major order" after you
have sorted variables based on their mean values to enhance compression as
in "previous variable as context"?

A technical question on zpaq's ability to achieve what I assume amounts to
delta-coding for compression:

Is zpaq, in effect, learning to parse a bit string in groups of 32 bits and
then performing arithmetic operations on the sequence of those groups
treated as integers to produce the deltas?  That's pretty impressive if so.

I suppose converting to integer requires scaling data that has fractional
values, which means that to recover the original data requires retaining
the scaling factors for those data.  So you're retaining that factor for
each variable, right?

Sample every 16 rows, 4 bit log approx
> 21,906,733 x-ms7c0.4.255i2.4c0.4.13795.255.255i1.3amst.zpaq
> 22,426,957 x-ms7c0.4.255.255c0.4.13795.255.255am.zpaq
>

That's a compression ratio of  4.2 to 1.  Pretty impressive given the lack
of any explicit use of statistical analysis techniques normally used in
deriving relationships.


> I did not attempt to normalize the data, but that may be the next thing I
> try.
>
> On Tue, Aug 22, 2023 at 2:13 PM James Bowery <[email protected]> wrote:
>
> > ** It's interesting that Wikipedia doesn't have a specific "History"
> section for factor analysis describing its origin.  Instead they have a
> bunch of sections on various applications of factor analysis that discuss
> the history of its application to that specialty.  Spearman committed the
> crime of discovering the g-factor of intelligence, the denial of which
> causing catastrophic damage to humanity because such denial is a globally
> imposed religious belief that racial group differences in outcome are not
> significantly influenced by heritable aspects of intelligence.
>
> I don't think you can determine that from the data. The correlations
> between race, education, income, and crime are well known and probably
> in this data, but that says nothing about whether the cause is
> genetics or environment. To show that random variable X affects Y, you
> need P(X) and P(Y). But the data only gives you the joint probability
> P(X, Y).


"the data" referring to "this data", you're *probably* right but even when
dealing with ecological variables, it is not inconceivable that a
*latent* variable
model may, by *assuming* reproductive dynamics in the population, produce
better compression.

In any event, to be clear, the damage I'm talking about results from
denying data that goes way beyond this dataset.

------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/T30092c5d8380b42f-M38796645fb843725e8d8c9c5
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Reply via email to