# Re: Mahoney/Sampo: [agi] Marcus Hutter's lossless compression of human knowledge prize

I realize it is tempting to use lossy text compression as a test for AI because that is what the human brain does when we read text and recall it in paraphrased fashion.  We remember the ideas and discard details about the _expression_ of those ideas.  A lossy text compressor that did the same thing would certainly demonstrate AI.

But there are two problems with using lossy compression as a test of AI:
1. The test is subjective.
2. Lossy compression does not imply AI.

Lets assume we solve the subjectivity problem by having human judges evaluate whether the decompressed output is "close enough" to the input.  We already do this with lossy image, audio and video compression (without much consensus).

The second problem remains: ideal lossy compression does not imply passing the Turing test.  For lossless compression, it can be proven that it does.  Let p(s) be the (unknown) probability that s will be the prefix of a text dialog.  Then a machine that can compute p(s) exactly is able to generate response A to question Q with the distribution p(QA)/p(Q) which is indistinguishable from human.  The same model minimizes the compressed size, E[log 1/p(s)].

This proof does not hold for lossy compression because different lossless models map to identical lossy models.  The desired property of a lossless compressor C is that if and only if s1 and s2 have the same meaning (to most people), then the encodings C(s1) = C(s2).  This code will ideally have length log 1/(p(s1)+p(s2)).  But this does not imply that the decompressor knows p(s1) or p(s2).  Thus, the decompressor may decompress to s1 or s2 or choose randomly between them.  In general, the output distribution will be different than the true distrubution p(s1), p(s2), so it will be distinguishable from human even if the compression ratio is ideal.

-- Matt Mahoney, [EMAIL PROTECTED]

----- Original Message ----
From: Mark Waser <[EMAIL PROTECTED]>
To: agi@v2.listbox.com
Sent: Tuesday, August 15, 2006 9:28:26 AM
Subject: Re: Mahoney/Sampo: [agi] Marcus Hutter's lossless compression of human knowledge prize

>> I don't see any point in this debate over lossless vs. lossy compression

Lets see if I can simplify it.
1. The stated goal is compressing human knowledge.
2. The exact, same knowledge can always be expressed in a *VERY* large number of different bit strings
3. Not being able to reproduce the exact bit string is lossy compression when viewed from the bit viewpoint but can be lossless from the knowledge viewpoint
4. Therefore, reproducing the bit string is an additional requirement above and beyond the stated goal
5. I strongly believe that this additional requirement will necessitate a *VERY* large amount of additional work not necessary for the stated goal
6. In addition, by information theory, reproducing the exact bit string will require additional information beyond the knowledge contained in it (since numerous different strings can encode the same knowledge)
7. Assuming optimal compression, also by by information theory, additional information will add to the compressed size (i.e. lead to a less optimal result).
So the question is "Given that bit-level reproduction is harder, not necessary for knowledge compression/intelligence, and doesn't allow for the same degree of compression.  Why make life tougher when it isn't necessary for your stated purposes and makes your results (i.e. compression) worse?"

----- Original Message -----
Sent: Tuesday, August 15, 2006 12:55 AM
Subject: Re: Sampo: [agi] Marcus Hutter's lossless compression of human knowledge prize

Where will the knowledge to compress text come from?  There are 3 possibilities.

1. externally supplied, like the lexical models (dictionaries) for paq8h and WinRK.
2. learned from the input in a separate pass, like xml-wrt|ppmonstr.
3. learned online in one pass, like paq8f and slim.

These all have the same effect on compressed size.  In the first case, you increase the size of the decompressor.  In the second, you have to append the model you learned from the first pass to the compressed file so it is available to the decompressor.  In the third case, compression is poor at the beginning.  From the viewpoint of information theory, there is no difference in these three approaches.  The penalty is the same.

To improve compression further, you will need to model semantics and/or syntax.  No compressor currently does this.  I think the reason is that it is not worthwhile unless you have hundreds of megabytes of natural language text.  In fact, only the top few compressors even have lexical models.  All the rest are byte oriented n-gram models.

A semantic model would know what words are related, like "star" and "moon".  It would learn this by their tendency to appear together.  You can build a dictionary of such knowledge from the data set itself or you can build it some other way (such as Wordnet) and include it in the decompressor.  If you learn it from the input, you could do it in a separate pass (like LSA) or you could do it in one pass (maybe an equivalent neural network) so that you build the model as you compress.

To learn syntax, you can cluster words by similarity of their immediate context.  These clusters correspond to part of speech.  For instance, "the X is" tells you that X is a noun.  You can model simple grammars as n-grams over their classifications, such as (Art Noun Verb).  Again, you can use any of 3 approaches.

Learning semantics and syntax is a hard problem, but I think you can see it can be done with statistical modeling.  The training data you need is in the input itself.

I don't see any point in this debate over lossless vs. lossy compression.  You have to solve the language learning problem in either case to improve compression.  I think it will be more productive to discuss how this can be done.

-- Matt Mahoney, [EMAIL PROTECTED]