Re: Mahoney/Sampo: [agi] Marcus Hutter's lossless compression of human knowledge prize

Mark Waser Tue, 15 Aug 2006 14:02:04 -0700

>> You could use Keogh's compression dissimilarity measure to test for inconsistency.

I don't think so. Take the following strings: "I only used red and yellow paint in the painting", "I painted the rose in my favorite color", "My favorite color is pink", "Orange is created by mixing red and yellow", "Pink is created by mixing red and white". How is Keogh's measure going to help you with that?

The problem is that Keogh's measure is intended for data-mining where you have separate instances, not one big entwined Gordian knot.

>> Now if only we had some test to tell which compressors have the best language models...

Huh? By definition, the compressor with the best language model is the one with the highest compression ratio.

----- Original Message -----

From: Matt Mahoney

To: agi@v2.listbox.com

Sent: Tuesday, August 15, 2006 3:54 PM

Subject: Re: Mahoney/Sampo: [agi] Marcus Hutter's lossless compression of human knowledge prize

You could use Keogh's compression dissimilarity measure to test for inconsistency.
http://www.cs.ucr.edu/~eamonn/SIGKDD_2004_long.pdf

CDM(x,y) = C(xy)/(C(x)+C(y)).

where x and y are strings, and C(x) means the compressed size of x (lossless). The measure ranges from about 0.5 if x = y to about 1.0 if x and y do not share any information. Then,

CDM("it is hot", "it is very warm") < CDM("it is hot", "it is cold").

assuming your compressor uses a good language model.

Now if only we had some test to tell which compressors have the best language models...

-- Matt Mahoney, [EMAIL PROTECTED]

----- Original Message ----
From: Mark Waser <[EMAIL PROTECTED]>
To: agi@v2.listbox.com
Sent: Tuesday, August 15, 2006 3:22:10 PM
Subject: Re: Mahoney/Sampo: [agi] Marcus Hutter's lossless compression of human knowledge prize

>> Could you please write a test program to objectively test for lossy text compression using your algorithm?

Writing the test program for the decompressing program is relatively easy. Since the requirement was that the decompressing program be able to recognize when a piece of knowledge is in the corpus, when it's negation is in the corpus, when an incorrect substitution has been made, and when a correct substitution has been made -- all you/I would need to do is invent (or obtain -- see two paragraphs down) a reasonably sized set of knowledge pieces to test, put them in a file, feed them to the decompressing program, and automatically grade it's answers as to which category each falls into. A reasonably small number of test cases should suffice as long as you don't advertise exactly which test cases are in the final test but once you're having competitors generate each other's tests, you can go hog-wild with the number.

Writing the test program for the compressing program is also easy but developing the master list of inconsistencies is going to be a real difficulty -- unless you use the various contenders themselves to generate various versions of the list. I strongly doubt that most contenders will get false positives but strongly suspect that finding all of the inconsistencies will be a major area for improvement as the systems become more sophisticated.

Note also that minor modifications of any decompressing program should also be able to create test cases for your decompressor test. Simply ask it for a random sampling of knowledge, for the negations of a random sampling of knowledge, for some incorrect substitutions, and some hierarchical substitutions of each type.

Any *real* contenders should be able to easily generate the tests for you.

>> You can start by listing all of the inconsistencies in Wikipedia.

see paragraph 2 above

>> To make the test objective, you will either need a function to test whether two strings are inconsistent or not, or else you need to show that people will never disagree on this matter.

It is impossible to show that people will never disagree on a matter.

On the other hand, a knowledge compressor is going to have to recognize when two pieces of knowledge conflict (i.e. when two strings parse into knowledge statements that cannot coexist). You can always have a contender evaluate whether a competitor's "inconsistencies" are incorrect and then do some examination by hand on a representative sample where the contender says it can't tell (since, again, I suspect you'll find few misidentified inconsistencies -- but that finding all of the inconsistencies will be ever subject to improvement).

>> >> Lossy compression does not imply AI.

>> >> A lossy text compressor that did the same thing (recall it in paraphrased fashion) would certainly demonstrate AI.

>> I disagree that these are inconsistent. Demonstrating and implying are different things.

I didn't say that they were inconsistent. What I meant to say was

that a decompressing program that is able to output all of the compressed file's knowledge in ordinary English would, in your words, "certainly demonstrate AI".
given statement 1, it's not a problem that "lossy compression does not imply AI" since the decompressing program would still "certainly demonstrate AI"

----- Original Message -----

From: Matt Mahoney

To: agi@v2.listbox.com

Sent: Tuesday, August 15, 2006 2:23 PM

Subject: Re: Mahoney/Sampo: [agi] Marcus Hutter's lossless compression of human knowledge prize

Mark,
Could you please write a test program to objectively test for lossy text compression using your algorithm? You can start by listing all of the inconsistencies in Wikipedia. To make the test objective, you will either need a function to test whether two strings are inconsistent or not, or else you need to show that people will never disagree on this matter.

>> Lossy compression does not imply AI.
>> A lossy text compressor that did the same thing (recall it in paraphrased fashion) would certainly demonstrate AI.

I disagree that these are inconsistent. Demonstrating and implying are different things.

-- Matt Mahoney, [EMAIL PROTECTED]

----- Original Message ----
From: Mark Waser <[EMAIL PROTECTED]>
To: agi@v2.listbox.com
Sent: Tuesday, August 15, 2006 12:55:24 PM
Subject: Re: Mahoney/Sampo: [agi] Marcus Hutter's lossless compression of human knowledge prize

>> 1. The test is subjective.

I disagree. If you have an automated test with clear criteria like the following, it will be completely objective:

  a) the compressing program must be able to output all inconsistencies in the corpus (in their original string form) AND

  b) the decompressing program must be able to do the following when presented with a standard list of test ideas/pieces of knowledge

        FOR EACH IDEA/PIECE OF KNOWLEDGE IN THE TEST WHICH IS NOT IN THE LIST OF INCONSISTENCIES

if the knowledge is in the corpus, recognize that it is in the corpus.
if the negation of the knowledge is in the corpus, recognize that the test knowledge is false according to the corpus.
if an incorrect substitution has been made to create the test item from an item the corpus (i.e. red for yellow, ten for twenty, etc.), recognize that the test knowledge is false according to the corpus.
if a possibly correct (hierarchical) substitution has been made to create the test item in the corpus, recognize that the substitution is either a) in the corpus for broader concepts (i.e. testing red for corpus lavender, testing dozens for corpus thirty-seven, etc) or b) that there is related information in the corpus which the test idea further refines for narrower substitutions

>> 2. Lossy compression does not imply AI.

and two sentences before

>> A lossy text compressor that did the same thing (recall it in paraphrased fashion) would certainly demonstrate AI.

Require that the decompressing program be able to output all of the compressed file's knowledge in ordinary English. This is a pretty trivial task compared to everything else.

        Mark

----- Original Message -----

From: Matt Mahoney

To: agi@v2.listbox.com

Sent: Tuesday, August 15, 2006 12:27 PM

Subject: Re: Mahoney/Sampo: [agi] Marcus Hutter's lossless compression of human knowledge prize

I realize it is tempting to use lossy text compression as a test for AI because that is what the human brain does when we read text and recall it in paraphrased fashion. We remember the ideas and discard details about the _expression_ of those ideas. A lossy text compressor that did the same thing would certainly demonstrate AI.

But there are two problems with using lossy compression as a test of AI:
1. The test is subjective.
2. Lossy compression does not imply AI.

Lets assume we solve the subjectivity problem by having human judges evaluate whether the decompressed output is "close enough" to the input. We already do this with lossy image, audio and video compression (without much consensus).

The second problem remains: ideal lossy compression does not imply passing the Turing test. For lossless compression, it can be proven that it does. Let p(s) be the (unknown) probability that s will be the prefix of a text dialog. Then a machine that can compute p(s) exactly is able to generate response A to question Q with the distribution p(QA)/p(Q) which is indistinguishable from human. The same model minimizes the compressed size, E[log 1/p(s)].

This proof does not hold for lossy compression because different lossless models map to identical lossy models. The desired property of a lossless compressor C is that if and only if s1 and s2 have the same meaning (to most people), then the encodings C(s1) = C(s2). This code will ideally have length log 1/(p(s1)+p(s2)). But this does not imply that the decompressor knows p(s1) or p(s2). Thus, the decompressor may decompress to s1 or s2 or choose randomly between them. In general, the output distribution will be different than the true distrubution p(s1), p(s2), so it will be distinguishable from human even if the compression ratio is ideal.

-- Matt Mahoney, [EMAIL PROTECTED]

----- Original Message ----
From: Mark Waser <[EMAIL PROTECTED]>
To: agi@v2.listbox.com
Sent: Tuesday, August 15, 2006 9:28:26 AM
Subject: Re: Mahoney/Sampo: [agi] Marcus Hutter's lossless compression of human knowledge prize

>> I don't see any point in this debate over lossless vs. lossy compression

Lets see if I can simplify it.

The stated goal is compressing human knowledge.
The exact, same knowledge can always be expressed in a *VERY* large number of different bit strings
Not being able to reproduce the exact bit string is lossy compression when viewed from the bit viewpoint but can be lossless from the knowledge viewpoint
Therefore, reproducing the bit string is an additional requirement above and beyond the stated goal
I strongly believe that this additional requirement will necessitate a *VERY* large amount of additional work not necessary for the stated goal
In addition, by information theory, reproducing the exact bit string will require additional information beyond the knowledge contained in it (since numerous different strings can encode the same knowledge)
Assuming optimal compression, also by by information theory, additional information will add to the compressed size (i.e. lead to a less optimal result).

So the question is "Given that bit-level reproduction is harder, not necessary for knowledge compression/intelligence, and doesn't allow for the same degree of compression. Why make life tougher when it isn't necessary for your stated purposes and makes your results (i.e. compression) worse?"

----- Original Message -----

From: Matt Mahoney

To: agi@v2.listbox.com

Sent: Tuesday, August 15, 2006 12:55 AM

Subject: Re: Sampo: [agi] Marcus Hutter's lossless compression of human knowledge prize

Where will the knowledge to compress text come from? There are 3 possibilities.

1. externally supplied, like the lexical models (dictionaries) for paq8h and WinRK.
2. learned from the input in a separate pass, like xml-wrt|ppmonstr.
3. learned online in one pass, like paq8f and slim.

These all have the same effect on compressed size. In the first case, you increase the size of the decompressor. In the second, you have to append the model you learned from the first pass to the compressed file so it is available to the decompressor. In the third case, compression is poor at the beginning. From the viewpoint of information theory, there is no difference in these three approaches. The penalty is the same.

To improve compression further, you will need to model semantics and/or syntax. No compressor currently does this. I think the reason is that it is not worthwhile unless you have hundreds of megabytes of natural language text. In fact, only the top few compressors even have lexical models. All the rest are byte oriented n-gram models.

A semantic model would know what words are related, like "star" and "moon". It would learn this by their tendency to appear together. You can build a dictionary of such knowledge from the data set itself or you can build it some other way (such as Wordnet) and include it in the decompressor. If you learn it from the input, you could do it in a separate pass (like LSA) or you could do it in one pass (maybe an equivalent neural network) so that you build the model as you compress.

To learn syntax, you can cluster words by similarity of their immediate context. These clusters correspond to part of speech. For instance, "the X is" tells you that X is a noun. You can model simple grammars as n-grams over their classifications, such as (Art Noun Verb). Again, you can use any of 3 approaches.

Learning semantics and syntax is a hard problem, but I think you can see it can be done with statistical modeling. The training data you need is in the input itself.

I don't see any point in this debate over lossless vs. lossy compression. You have to solve the language learning problem in either case to improve compression. I think it will be more productive to discuss how this can be done.

-- Matt Mahoney, [EMAIL PROTECTED]

To unsubscribe, change your address, or temporarily deactivate your subscription, please go to http://v2.listbox.com/member/[EMAIL PROTECTED]
To unsubscribe, change your address, or temporarily deactivate your subscription, please go to http://v2.listbox.com/member/[EMAIL PROTECTED]

To unsubscribe, change your address, or temporarily deactivate your subscription, please go to http://v2.listbox.com/member/[EMAIL PROTECTED]
To unsubscribe, change your address, or temporarily deactivate your subscription, please go to http://v2.listbox.com/member/[EMAIL PROTECTED]

To unsubscribe, change your address, or temporarily deactivate your subscription, please go to http://v2.listbox.com/member/[EMAIL PROTECTED]
To unsubscribe, change your address, or temporarily deactivate your subscription, please go to http://v2.listbox.com/member/[EMAIL PROTECTED]

To unsubscribe, change your address, or temporarily deactivate your subscription, please go to http://v2.listbox.com/member/[EMAIL PROTECTED]

To unsubscribe, change your address, or temporarily deactivate your subscription, please go to http://v2.listbox.com/member/[EMAIL PROTECTED]

Re: Mahoney/Sampo: [agi] Marcus Hutter's lossless compression of human knowledge prize

Reply via email to