Re: Mahoney/Sampo: [agi] Marcus Hutter's lossless compression of human knowledge prize

Mark Waser Tue, 15 Aug 2006 09:56:17 -0700

>> 1. The test is subjective.

I disagree. If you have an automated test with clear criteria like the following, it will be completely objective:

a) the compressing program must be able to output all inconsistencies in the corpus (in their original string form) AND

b) the decompressing program must be able to do the following when presented with a standard list of test ideas/pieces of knowledge

FOR EACH IDEA/PIECE OF KNOWLEDGE IN THE TEST WHICH IS NOT IN THE LIST OF INCONSISTENCIES

if the knowledge is in the corpus, recognize that it is in the corpus.
if the negation of the knowledge is in the corpus, recognize that the test knowledge is false according to the corpus.
if an incorrect substitution has been made to create the test item from an item the corpus (i.e. red for yellow, ten for twenty, etc.), recognize that the test knowledge is false according to the corpus.
if a possibly correct (hierarchical) substitution has been made to create the test item in the corpus, recognize that the substitution is either a) in the corpus for broader concepts (i.e. testing red for corpus lavender, testing dozens for corpus thirty-seven, etc) or b) that there is related information in the corpus which the test idea further refines for narrower substitutions

>> 2. Lossy compression does not imply AI.

and two sentences before

>> A lossy text compressor that did the same thing (recall it in paraphrased fashion) would certainly demonstrate AI.

Require that the decompressing program be able to output all of the compressed file's knowledge in ordinary English. This is a pretty trivial task compared to everything else.

Mark

----- Original Message -----

From: Matt Mahoney

To: [email protected]

Sent: Tuesday, August 15, 2006 12:27 PM

Subject: Re: Mahoney/Sampo: [agi] Marcus Hutter's lossless compression of human knowledge prize

I realize it is tempting to use lossy text compression as a test for AI because that is what the human brain does when we read text and recall it in paraphrased fashion. We remember the ideas and discard details about the _expression_ of those ideas. A lossy text compressor that did the same thing would certainly demonstrate AI.

But there are two problems with using lossy compression as a test of AI:
1. The test is subjective.
2. Lossy compression does not imply AI.

Lets assume we solve the subjectivity problem by having human judges evaluate whether the decompressed output is "close enough" to the input. We already do this with lossy image, audio and video compression (without much consensus).

The second problem remains: ideal lossy compression does not imply passing the Turing test. For lossless compression, it can be proven that it does. Let p(s) be the (unknown) probability that s will be the prefix of a text dialog. Then a machine that can compute p(s) exactly is able to generate response A to question Q with the distribution p(QA)/p(Q) which is indistinguishable from human. The same model minimizes the compressed size, E[log 1/p(s)].

This proof does not hold for lossy compression because different lossless models map to identical lossy models. The desired property of a lossless compressor C is that if and only if s1 and s2 have the same meaning (to most people), then the encodings C(s1) = C(s2). This code will ideally have length log 1/(p(s1)+p(s2)). But this does not imply that the decompressor knows p(s1) or p(s2). Thus, the decompressor may decompress to s1 or s2 or choose randomly between them. In general, the output distribution will be different than the true distrubution p(s1), p(s2), so it will be distinguishable from human even if the compression ratio is ideal.

-- Matt Mahoney, [EMAIL PROTECTED]

----- Original Message ----
From: Mark Waser <[EMAIL PROTECTED]>
To: [email protected]
Sent: Tuesday, August 15, 2006 9:28:26 AM
Subject: Re: Mahoney/Sampo: [agi] Marcus Hutter's lossless compression of human knowledge prize

>> I don't see any point in this debate over lossless vs. lossy compression

Lets see if I can simplify it.

The stated goal is compressing human knowledge.
The exact, same knowledge can always be expressed in a *VERY* large number of different bit strings
Not being able to reproduce the exact bit string is lossy compression when viewed from the bit viewpoint but can be lossless from the knowledge viewpoint
Therefore, reproducing the bit string is an additional requirement above and beyond the stated goal
I strongly believe that this additional requirement will necessitate a *VERY* large amount of additional work not necessary for the stated goal
In addition, by information theory, reproducing the exact bit string will require additional information beyond the knowledge contained in it (since numerous different strings can encode the same knowledge)
Assuming optimal compression, also by by information theory, additional information will add to the compressed size (i.e. lead to a less optimal result).

So the question is "Given that bit-level reproduction is harder, not necessary for knowledge compression/intelligence, and doesn't allow for the same degree of compression. Why make life tougher when it isn't necessary for your stated purposes and makes your results (i.e. compression) worse?"

----- Original Message -----

From: Matt Mahoney

To: [email protected]

Sent: Tuesday, August 15, 2006 12:55 AM

Subject: Re: Sampo: [agi] Marcus Hutter's lossless compression of human knowledge prize

Where will the knowledge to compress text come from? There are 3 possibilities.

1. externally supplied, like the lexical models (dictionaries) for paq8h and WinRK.
2. learned from the input in a separate pass, like xml-wrt|ppmonstr.
3. learned online in one pass, like paq8f and slim.

These all have the same effect on compressed size. In the first case, you increase the size of the decompressor. In the second, you have to append the model you learned from the first pass to the compressed file so it is available to the decompressor. In the third case, compression is poor at the beginning. From the viewpoint of information theory, there is no difference in these three approaches. The penalty is the same.

To improve compression further, you will need to model semantics and/or syntax. No compressor currently does this. I think the reason is that it is not worthwhile unless you have hundreds of megabytes of natural language text. In fact, only the top few compressors even have lexical models. All the rest are byte oriented n-gram models.

A semantic model would know what words are related, like "star" and "moon". It would learn this by their tendency to appear together. You can build a dictionary of such knowledge from the data set itself or you can build it some other way (such as Wordnet) and include it in the decompressor. If you learn it from the input, you could do it in a separate pass (like LSA) or you could do it in one pass (maybe an equivalent neural network) so that you build the model as you compress.

To learn syntax, you can cluster words by similarity of their immediate context. These clusters correspond to part of speech. For instance, "the X is" tells you that X is a noun. You can model simple grammars as n-grams over their classifications, such as (Art Noun Verb). Again, you can use any of 3 approaches.

Learning semantics and syntax is a hard problem, but I think you can see it can be done with statistical modeling. The training data you need is in the input itself.

I don't see any point in this debate over lossless vs. lossy compression. You have to solve the language learning problem in either case to improve compression. I think it will be more productive to discuss how this can be done.

-- Matt Mahoney, [EMAIL PROTECTED]

To unsubscribe, change your address, or temporarily deactivate your subscription, please go to http://v2.listbox.com/member/[EMAIL PROTECTED]
To unsubscribe, change your address, or temporarily deactivate your subscription, please go to http://v2.listbox.com/member/[EMAIL PROTECTED]

To unsubscribe, change your address, or temporarily deactivate your subscription, please go to http://v2.listbox.com/member/[EMAIL PROTECTED]

To unsubscribe, change your address, or temporarily deactivate your subscription, please go to http://v2.listbox.com/member/[EMAIL PROTECTED]

Re: Mahoney/Sampo: [agi] Marcus Hutter's lossless compression of human knowledge prize

Reply via email to