The argument for lossy vs. lossless compression as a test for AI seems to be motivated by the fact that humans use lossy compression to store memory, and cannot do lossless compression at all.  The reason is that lossless compression requires the ability to do deterministic computation.  Lossy compression does not.  So this distinction is not important for machines.

The proof that an ideal language model implies passing the Turing test requires a lossless model.  A lossy model has only partial knowledge of the distribution of strings in natural language dialogs.  Without full knowledge, it is not possible to duplicate the same distribution of equivalent representations of the same idea, allowing such expressions to be recognized as not human, even if the compression is ideal.  For example, a lossy compressor might compress all of the following to the same code: "it is hot", "it is quite warm", "it is 107 degrees", "the burning desert sun seared my skin", etc.  This distribution of expressions of equivalent ideas (or almost equivalent) is not uniform.  Humans recognize that some expressions are more common than others, but an ideal lossy compressor is unable to regenerate the same distribution.  (If it could, it would be a lossless model).  It only needs to know the sum of the probabilities for ideal compression.

This example brings up another issue.  Who is to say if two expressions represent the same idea?  The problem itself requires AI.

The proper way to avoid coding equivalent representations in an objective way is to remove all noise (e.g. misspelled words, grammatical errors, arbitrary line breaks), from the data set and put it in a canonical form, so there can only be one way to represent the ideas within.  This would remove any distinction between lossy and lossless compression.  However it would be a gargantuan task.  It would take a lifetime to read 1 GB of text.  But by using Wikipedia, most of this work has already been done.  There are very few spelling or grammar errors due to extensive review, and there is a rather uniform style.  Line breaks only occur on paragraph boundaries.

Uncompressed video would be the absolutely worst type of test data.  Uncompressed video is about 10^8 to 10^9 bits per second.  The human brain has a long term learning rate of around 10 bits per second.  So all the rest is noise.  How are you going to remove that prior to compression?

There are no objective functions to compare the quality of lossy decompression.  For images, we have PSNR, which is the RMS error of the pixel differences between the original and reconstructed images.  But this is a poor measure.  For example, if I increased the brightness of all pixels by 1%, you would not see any difference.  However if I increased the brightness of just the top half of the image by 1%, then the PSNR would be reduced by 50% but there would be an obvious horizontal line across the image.  Any test of lossy quality has to be subjective.

This is not to say that investigating how humans do lossy compression isn't an important field of study.  I think it is essential to understanding how vision, hearing, and the other senses work and how that data is processed.  We currently do not have good models to describe how human decide what to remember and what to discard.

But the Hutter prize is to motivate better language models, not vision or hearing or robotics.  For that task, I think lossless text compression is the right approach.

 -- Matt Mahoney, [EMAIL PROTECTED]


----- Original Message ----
From: boris <[EMAIL PROTECTED]>
To: agi@v2.listbox.com
Sent: Saturday, August 19, 2006 10:25:58 PM
Subject: [agi] Lossy *&* lossless compression

It's been said that we have to go after lossless compression because there's no way to objectively measure the quality of lossy compression. That makes sense only in the context of dumb indiscriminate transforms conventionally used for compression.

If compression is produced by pattern recognition we can quantify lossless compression of individual patterns, which is a perfectly objective criterion for selectively *losing* insufficiently compressed patterns. To make Hutter's prize meaningful it must be awarded for compression of the *best* patterns, rather than of the whole data set. And, of course, linguistic/semantic data is a lousy place to start, it's already been heavily compressed by "algorithms" unknown to any autonomous system. An uncompressed movie would be a far, far better data sample. Also, the real criterion of intelligence is prediction, which is a *projected* compression of future data. The difference is that current compression is time-symmetrical, while prediction obviously isn't.

To unsubscribe, change your address, or temporarily deactivate your subscription, please go to http://v2.listbox.com/member/[EMAIL PROTECTED]


To unsubscribe, change your address, or temporarily deactivate your subscription, please go to http://v2.listbox.com/member/[EMAIL PROTECTED]

Reply via email to