Re: Mahoney/Sampo: [agi] Marcus Hutter's lossless compression of human knowledge prize

Matt Mahoney Wed, 16 Aug 2006 11:07:47 -0700

If dumb models kill smart ones in text compression, then how do you know they are dumb? What is your objective test of "smart"? The fact is that in speech recognition research, language models with a lower perplexity also have lower word error rates.

We have "smart" statistical parsers that are 60% accurate when trained and tested on manually labeled text. So why haven't we solved the AI problem? Meanwhile, a "dumb" model like matching query words to document words enables Google to answer natural language queries, while our smart parsers choke when you misspell a word. Who is smart and who is dumb?

-- Matt Mahoney, [EMAIL PROTECTED]

----- Original Message ----
From: Mark Waser <[EMAIL PROTECTED]>
To: agi@v2.listbox.com
Sent: Wednesday, August 16, 2006 9:17:52 AM
Subject: Re: Mahoney/Sampo: [agi] Marcus Hutter's lossless compression of human knowledge prize

>> You group the strings into a fixed set and a variable set and concatenate them. The variable set could be just "I only used red and yellow paint in the painting", and you compare the CDM replacing "yellow" with "white". Of course your compressor must be capable of abstract reasoning and have a world model.

Very nice example of "homonculous"/"turtles-all-the-way-down" reasoning.

>> The problem is that many people do not believe that text compression is related to AI (even though speech recognition researchers have been evaluating models by perplexity since the early 1990's).

I believe that it's related to AI . . . . but that the dumbest models kill intelligent models every time . . . . which then makes AI useless for text compression

And bit-level text storage and reproduction is unnecessary for AI (and adds a lot of needless complexity) . . . .

So why are combining the two?

----- Original Message -----

From: Matt Mahoney

To: agi@v2.listbox.com

Sent: Tuesday, August 15, 2006 6:02 PM

Subject: Re: Mahoney/Sampo: [agi] Marcus Hutter's lossless compression of human knowledge prize

Mark wrote:

>Huh? By definition, the compressor with the best language model is the one with the highest compression ratio.

I'm glad we finally agree :-)

>> You could use Keogh's compression dissimilarity measure to test for inconsistency.

I don't think so. Take the following strings: "I only used red and yellow paint in the painting", "I painted the rose in my favorite color", "My favorite color is pink", "Orange is created by mixing red and yellow", "Pink is created by mixing red and white". How is Keogh's measure going to help you with that?

You group the strings into a fixed set and a variable set and concatenate them. The variable set could be just "I only used red and yellow paint in the painting", and you compare the CDM replacing "yellow" with "white". Of course your compressor must be capable of abstract reasoning and have a world model.

To answer Phil's post:

Text compression is only near the theoretical limts for small files. For large files, there is progress to be made integrating known syntactic and semantic modeling techniques into general purpose compressors. The theoretical limit is about 1 bpc and we are not there yet. See the graph at http://cs.fit.edu/~mmahoney/dissertation/

The proof that I gave that a language model implies passing the Turing test is for the ideal case where all people share identical models. The ideal case is deterministic. For the real case where models differ, passing the test is easier because a judge will attribute some machine errors to normal human variation. I discuss this in more detail at http://cs.fit.edu/~mmahoney/compression/rationale.html (text compression is equivalent to AI).

It is really hard to get funding for text compression research (or AI). I had to change my dissertation topic to network security in 1999 because my advisor had funding for that. As a postdoc I applied for a $50K NSF grant for a text compression contest. It was rejected, so I started one without funding (which we now have). The problem is that many people do not believe that text compression is related to AI (even though speech recognition researchers have been evaluating models by perplexity since the early 1990's).

-- Matt Mahoney, [EMAIL PROTECTED]

----- Original Message ----
From: Mark Waser <[EMAIL PROTECTED]>
To: agi@v2.listbox.com
Sent: Tuesday, August 15, 2006 5:00:47 PM
Subject: Re: Mahoney/Sampo: [agi] Marcus Hutter's lossless compression of human knowledge prize

>> You could use Keogh's compression dissimilarity measure to test for inconsistency.

I don't think so. Take the following strings: "I only used red and yellow paint in the painting", "I painted the rose in my favorite color", "My favorite color is pink", "Orange is created by mixing red and yellow", "Pink is created by mixing red and white". How is Keogh's measure going to help you with that?

The problem is that Keogh's measure is intended for data-mining where you have separate instances, not one big entwined Gordian knot.

>> Now if only we had some test to tell which compressors have the best language models...

Huh? By definition, the compressor with the best language model is the one with the highest compression ratio.

----- Original Message -----

From: Matt Mahoney

To: agi@v2.listbox.com

Sent: Tuesday, August 15, 2006 3:54 PM

Subject: Re: Mahoney/Sampo: [agi] Marcus Hutter's lossless compression of human knowledge prize

You could use Keogh's compression dissimilarity measure to test for inconsistency.
http://www.cs.ucr.edu/~eamonn/SIGKDD_2004_long.pdf

CDM(x,y) = C(xy)/(C(x)+C(y)).

where x and y are strings, and C(x) means the compressed size of x (lossless). The measure ranges from about 0.5 if x = y to about 1.0 if x and y do not share any information. Then,

CDM("it is hot", "it is very warm") < CDM("it is hot", "it is cold").

assuming your compressor uses a good language model.

Now if only we had some test to tell which compressors have the best language models...

-- Matt Mahoney, [EMAIL PROTECTED]

To unsubscribe, change your address, or temporarily deactivate your subscription, please go to http://v2.listbox.com/member/[EMAIL PROTECTED]

To unsubscribe, change your address, or temporarily deactivate your subscription, please go to http://v2.listbox.com/member/[EMAIL PROTECTED]

Re: Mahoney/Sampo: [agi] Marcus Hutter's lossless compression of human knowledge prize

Reply via email to