>> Could
you please write a test program to objectively test for lossy text compression
using your algorithm?
Writing the test program for the decompressing
program is relatively easy. Since the requirement was that the
decompressing program be able to recognize when a piece of knowledge is in the
corpus, when it's negation is in the corpus, when an incorrect substitution has
been made, and when a correct substitution has been made -- all you/I would need
to do is invent (or obtain -- see two paragraphs down) a
reasonably sized set of knowledge pieces to test, put them in a file, feed them
to the decompressing program, and automatically grade it's answers as to which
category each falls into. A reasonably small number of test
cases should suffice as long as you don't advertise exactly which test cases are
in the final test but once you're having competitors generate each other's
tests, you can go hog-wild with the number.
Writing the test program for the compressing
program is also easy but developing the master list of inconsistencies is going
to be a real difficulty -- unless you use the various contenders themselves to
generate various versions of the list. I strongly doubt that most
contenders will get false positives but strongly suspect that finding all of the
inconsistencies will be a major area for improvement as the systems become more
sophisticated.
Note also that minor modifications of any
decompressing program should also be able to create test cases for your
decompressor test. Simply ask it for a random sampling of knowledge, for
the negations of a random sampling of knowledge, for some incorrect
substitutions, and some hierarchical substitutions of each type.
Any *real* contenders should be able to easily
generate the tests for you.
>> You
can start by listing all of the inconsistencies in
Wikipedia.
see paragraph 2 above
>> To
make the test objective, you will either need a function to test whether two
strings are inconsistent or not, or else you need to show that people will never
disagree on this matter.
It is impossible to show that people will never
disagree on a matter.
On the other hand, a knowledge compressor is going
to have to recognize when two pieces of knowledge conflict (i.e. when two
strings parse into knowledge statements that cannot coexist). You can
always have a contender evaluate whether a competitor's
"inconsistencies" are incorrect and then do some examination by hand on a
representative sample where the contender says it can't tell (since,
again, I suspect you'll find few misidentified inconsistencies -- but that
finding all of the inconsistencies will be ever subject to
improvement).
>> >> Lossy compression does not imply
AI.
>> >> A lossy text compressor that did
the same thing (recall it in paraphrased fashion) would certainly
demonstrate AI.
>> I disagree
that these are inconsistent. Demonstrating and implying are different
things.
I didn't say that they were inconsistent. What I meant to say was
- that a decompressing program that is able to output all
of the compressed file's knowledge in ordinary English would, in your
words, "certainly demonstrate AI".
- given statement 1, it's not a problem that "lossy compression does not
imply AI" since the decompressing program would still "certainly demonstrate
AI"
----- Original Message -----
Sent: Tuesday, August 15, 2006 2:23
PM
Subject: Re: Mahoney/Sampo: [agi] Marcus
Hutter's lossless compression of human knowledge prize
Mark, Could
you please write a test program to objectively test for lossy text compression
using your algorithm? You can start by listing all of the
inconsistencies in Wikipedia. To make the test objective, you will
either need a function to test whether two strings are inconsistent or not, or
else you need to show that people will never disagree on this matter.
>> Lossy compression does not imply
AI. >> A lossy text compressor that
did the same thing (recall it in paraphrased fashion) would certainly
demonstrate AI.
I disagree that
these are inconsistent. Demonstrating and implying are different
things.
-- Matt Mahoney, [EMAIL PROTECTED]
-----
Original Message ---- From: Mark Waser <[EMAIL PROTECTED]> To:
agi@v2.listbox.com Sent: Tuesday, August 15, 2006 12:55:24 PM Subject:
Re: Mahoney/Sampo: [agi] Marcus Hutter's lossless compression of human
knowledge prize
>> 1.
The test is subjective.
I disagree. If you have an automated test
with clear criteria like the following, it will be completely
objective:
a) the compressing program must
be able to output all inconsistencies in the corpus (in their original string
form) AND
b) the decompressing program
must be able to do the following when presented with a standard list of test
ideas/pieces of knowledge
FOR EACH
IDEA/PIECE OF KNOWLEDGE IN THE TEST WHICH IS NOT IN THE LIST OF
INCONSISTENCIES
- if the knowledge is in the corpus, recognize that it is
in the corpus.
- if the negation of the knowledge is in the corpus, recognize that
the test knowledge is false according to the corpus.
- if an incorrect substitution has been made to create
the test item from an item the corpus (i.e. red for
yellow, ten for twenty, etc.), recognize that the test
knowledge is false according to the corpus.
- if a possibly correct (hierarchical) substitution has been made
to create the test item in the corpus, recognize that the
substitution is either a) in the corpus for broader concepts
(i.e. testing red for corpus lavender, testing dozens for corpus
thirty-seven, etc) or b) that there is related information in
the corpus which the test idea further refines for narrower
substitutions
>> 2. Lossy compression does not imply
AI.
and two sentences before
>> A lossy text compressor that did the
same thing (recall it in paraphrased fashion) would certainly demonstrate
AI.
Require that the decompressing
program be able to output all of the compressed file's knowledge in
ordinary English. This is a pretty trivial task compared to everything
else.
Mark
----- Original Message -----
Sent:
Tuesday, August 15, 2006 12:27 PM
Subject:
Re: Mahoney/Sampo: [agi] Marcus Hutter's lossless compression of human
knowledge prize
I realize it is tempting to use lossy text compression as a test for AI
because that is what the human brain does when we read text and recall it in
paraphrased fashion. We remember the ideas and discard details about
the _expression_ of those ideas. A lossy text compressor that did the
same thing would certainly demonstrate AI.
But there are two problems
with using lossy compression as a test of AI: 1. The test is
subjective. 2. Lossy compression does not imply AI.
Lets assume we
solve the subjectivity problem by having human judges evaluate whether the
decompressed output is "close enough" to the input. We already do this
with lossy image, audio and video compression (without much
consensus).
The second problem remains: ideal lossy compression does
not imply passing the Turing test. For lossless compression, it can be
proven that it does. Let p(s) be the (unknown) probability that s will
be the prefix of a text dialog. Then a machine that can compute p(s)
exactly is able to generate response A to question Q with the distribution
p(QA)/p(Q) which is indistinguishable from human. The same model
minimizes the compressed size, E[log 1/p(s)].
This proof does not
hold for lossy compression because different lossless models map to
identical lossy models. The desired property of a lossless compressor
C is that if and only if s1 and s2 have the same meaning (to most people),
then the encodings C(s1) = C(s2). This code will ideally have length
log 1/(p(s1)+p(s2)). But this does not imply that the decompressor
knows p(s1) or p(s2). Thus, the decompressor may decompress to s1 or
s2 or choose randomly between them. In general, the output
distribution will be different than the true distrubution p(s1), p(s2), so
it will be distinguishable from human even if the compression ratio is
ideal. -- Matt Mahoney, [EMAIL PROTECTED]
-----
Original Message ---- From: Mark Waser
<[EMAIL PROTECTED]> To: agi@v2.listbox.com Sent: Tuesday,
August 15, 2006 9:28:26 AM Subject: Re: Mahoney/Sampo: [agi] Marcus
Hutter's lossless compression of human knowledge prize
>> I
don't see any point in this debate over lossless vs. lossy
compression
Lets see if I can simplify it.
- The stated goal is compressing human
knowledge.
- The exact, same knowledge can always be
expressed in a *VERY* large number of different bit strings
- Not being able to reproduce the exact bit
string is lossy compression when viewed from the bit viewpoint but
can be lossless from the knowledge viewpoint
- Therefore, reproducing the bit string
is an additional requirement above and beyond the stated goal
- I strongly believe that this additional
requirement will necessitate a *VERY* large amount of additional work not
necessary for the stated goal
- In addition, by information theory,
reproducing the exact bit string will require additional information
beyond the knowledge contained in it (since numerous different strings can
encode the same knowledge)
- Assuming optimal compression, also by by
information theory, additional information will add to the compressed size
(i.e. lead to a less optimal result).
So the question is "Given that bit-level
reproduction is harder, not necessary for knowledge
compression/intelligence, and doesn't allow for the same degree of
compression. Why make life tougher when it isn't necessary for
your stated purposes and makes your results (i.e. compression)
worse?"
-----
Original Message -----
Sent:
Tuesday, August 15, 2006 12:55 AM
Subject:
Re: Sampo: [agi] Marcus Hutter's lossless compression of human knowledge
prize
Where
will the knowledge to compress text come from? There are 3
possibilities. 1. externally supplied, like the lexical models
(dictionaries) for paq8h and WinRK. 2. learned from the input in a
separate pass, like xml-wrt|ppmonstr. 3. learned online in one pass,
like paq8f and slim. These all have the same effect on compressed
size. In the first case, you increase the size of the
decompressor. In the second, you have to append the model you
learned from the first pass to the compressed file so it is available to
the decompressor. In the third case, compression is poor at the
beginning. From the viewpoint of information theory, there is no
difference in these three approaches. The penalty is the
same. To improve compression further, you will need to model
semantics and/or syntax. No compressor currently does this. I
think the reason is that it is not worthwhile unless you have hundreds of
megabytes of natural language text. In fact, only the top few
compressors even have lexical models. All the rest are byte oriented
n-gram models. A semantic model would know what words are related,
like "star" and "moon". It would learn this by their tendency to
appear together. You can build a dictionary of such knowledge from
the data set itself or you can build it some other way (such as Wordnet)
and include it in the decompressor. If you learn it from the input,
you could do it in a separate pass (like LSA) or you could do it in one
pass (maybe an equivalent neural network) so that you build the model as
you compress. To learn syntax, you can cluster words by similarity
of their immediate context. These clusters correspond to part of
speech. For instance, "the X is" tells you that X is a noun.
You can model simple grammars as n-grams over their classifications, such
as (Art Noun Verb). Again, you can use any of 3
approaches. Learning semantics and syntax is a hard problem, but I
think you can see it can be done with statistical modeling. The
training data you need is in the input itself. I don't see any
point in this debate over lossless vs. lossy compression. You have
to solve the language learning problem in either case to improve
compression. I think it will be more productive to discuss how this
can be done.
-- Matt Mahoney, [EMAIL PROTECTED]
To
unsubscribe, change your address, or temporarily deactivate your
subscription, please go to http://v2.listbox.com/member/[EMAIL PROTECTED]
To
unsubscribe, change your address, or temporarily deactivate your
subscription, please go to http://v2.listbox.com/member/[EMAIL PROTECTED]
To
unsubscribe, change your address, or temporarily deactivate your
subscription, please go to http://v2.listbox.com/member/[EMAIL PROTECTED]
To
unsubscribe, change your address, or temporarily deactivate your subscription,
please go to http://v2.listbox.com/member/[EMAIL PROTECTED]
To unsubscribe, change your address, or temporarily deactivate your
subscription, please go to http://v2.listbox.com/member/[EMAIL PROTECTED]
To unsubscribe, change your address, or temporarily deactivate your subscription,
please go to http://v2.listbox.com/member/[EMAIL PROTECTED]
|