>> You
could use Keogh's compression dissimilarity measure to test for
inconsistency.
I don't think so. Take the following strings:
"I only used red and yellow paint in the painting", "I painted the rose in my
favorite color", "My favorite color is pink", "Orange is created by mixing red
and yellow", "Pink is created by mixing red and white". How is Keogh's
measure going to help you with that?
The problem is that Keogh's measure is intended for
data-mining where you have separate instances, not one big entwined Gordian
knot.
>> Now if
only we had some test to tell which compressors have the best language
models...
Huh? By definition, the compressor with the best
language model is the one with the highest compression ratio.
----- Original Message -----
Sent: Tuesday, August 15, 2006 3:54
PM
Subject: Re: Mahoney/Sampo: [agi] Marcus
Hutter's lossless compression of human knowledge prize
You
could use Keogh's compression dissimilarity measure to test for
inconsistency. http://www.cs.ucr.edu/~eamonn/SIGKDD_2004_long.pdf
CDM(x,y) = C(xy)/(C(x)+C(y)). where x and y are strings, and C(x) means
the compressed size of x (lossless). The measure ranges from about 0.5
if x = y to about 1.0 if x and y do not share any information.
Then, CDM("it is hot", "it is very warm") < CDM("it is hot",
"it is cold"). assuming your compressor uses a good language
model. Now if only we had some test to tell which compressors have the
best language models...
-- Matt Mahoney, [EMAIL PROTECTED]
-----
Original Message ---- From: Mark Waser <[EMAIL PROTECTED]> To:
agi@v2.listbox.com Sent: Tuesday, August 15, 2006 3:22:10 PM Subject:
Re: Mahoney/Sampo: [agi] Marcus Hutter's lossless compression of human
knowledge prize
>> Could you please write a test program to objectively test for lossy
text compression using your algorithm?
Writing the test program for the
decompressing program is relatively easy. Since the requirement was
that the decompressing program be able to recognize when a piece of knowledge
is in the corpus, when it's negation is in the corpus, when an incorrect
substitution has been made, and when a correct substitution has been made --
all you/I would need to do is invent (or obtain -- see two
paragraphs down) a reasonably sized set of knowledge pieces to test, put
them in a file, feed them to the decompressing program, and automatically
grade it's answers as to which category each falls into. A
reasonably small number of test cases should suffice as long as you don't
advertise exactly which test cases are in the final test but once you're
having competitors generate each other's tests, you can go hog-wild with the
number.
Writing the test program for the compressing
program is also easy but developing the master list of inconsistencies is
going to be a real difficulty -- unless you use the various contenders
themselves to generate various versions of the list. I strongly doubt
that most contenders will get false positives but strongly suspect that
finding all of the inconsistencies will be a major area for improvement as the
systems become more sophisticated.
Note also that minor modifications of any
decompressing program should also be able to create test cases for your
decompressor test. Simply ask it for a random sampling of knowledge, for
the negations of a random sampling of knowledge, for some incorrect
substitutions, and some hierarchical substitutions of each type.
Any *real* contenders should be able to easily
generate the tests for you.
>> You
can start by listing all of the inconsistencies in
Wikipedia.
see paragraph 2 above
>> To
make the test objective, you will either need a function to test whether two
strings are inconsistent or not, or else you need to show that people will
never disagree on this matter.
It is impossible to show that people will never
disagree on a matter.
On the other hand, a knowledge compressor is
going to have to recognize when two pieces of knowledge conflict (i.e. when
two strings parse into knowledge statements that cannot coexist). You
can always have a contender evaluate whether a competitor's
"inconsistencies" are incorrect and then do some examination by hand on a
representative sample where the contender says it can't tell (since,
again, I suspect you'll find few misidentified inconsistencies -- but
that finding all of the inconsistencies will be ever subject to
improvement).
>> >> Lossy
compression does not imply AI.
>> >> A lossy
text compressor that did the same thing (recall it in paraphrased
fashion) would certainly demonstrate AI.
>> I disagree
that these are inconsistent. Demonstrating and implying are different
things.
I didn't say that they were inconsistent. What I meant to say was
- that a decompressing program that is able to output
all of the compressed file's knowledge in ordinary English would, in your
words, "certainly demonstrate AI".
- given statement 1, it's not a problem that "lossy compression does
not imply AI" since the decompressing program would still "certainly
demonstrate AI"
----- Original Message -----
Sent:
Tuesday, August 15, 2006 2:23 PM
Subject:
Re: Mahoney/Sampo: [agi] Marcus Hutter's lossless compression of human
knowledge prize
Mark, Could
you please write a test program to objectively test for lossy text
compression using your algorithm? You can start by listing all of the
inconsistencies in Wikipedia. To make the test objective, you will
either need a function to test whether two strings are inconsistent or not,
or else you need to show that people will never disagree on this
matter.
>> Lossy compression does not imply
AI. >> A lossy text compressor
that did the same thing (recall it in paraphrased fashion) would
certainly demonstrate AI.
I disagree that
these are inconsistent. Demonstrating and implying are different
things.
-- Matt Mahoney, [EMAIL PROTECTED]
-----
Original Message ---- From: Mark Waser
<[EMAIL PROTECTED]> To: agi@v2.listbox.com Sent: Tuesday,
August 15, 2006 12:55:24 PM Subject: Re: Mahoney/Sampo: [agi] Marcus
Hutter's lossless compression of human knowledge prize
>> 1.
The test is subjective.
I disagree. If you have an automated test
with clear criteria like the following, it will be completely
objective:
a) the compressing program
must be able to output all inconsistencies in the corpus (in their original
string form) AND
b) the decompressing program
must be able to do the following when presented with a standard list of test
ideas/pieces of knowledge
FOR EACH
IDEA/PIECE OF KNOWLEDGE IN THE TEST WHICH IS NOT IN THE LIST OF
INCONSISTENCIES
- if the knowledge is in the corpus, recognize that it is
in the corpus.
- if the negation of the knowledge is in the corpus, recognize that
the test knowledge is false according to the corpus.
- if an incorrect substitution has been made to create
the test item from an item the corpus (i.e. red for
yellow, ten for twenty, etc.), recognize that the test
knowledge is false according to the corpus.
- if a possibly correct (hierarchical) substitution has been made
to create the test item in the corpus, recognize that the
substitution is either a) in the corpus for broader concepts
(i.e. testing red for corpus lavender, testing dozens for corpus
thirty-seven, etc) or b) that there is related information
in the corpus which the test idea further refines for narrower
substitutions
>> 2. Lossy compression does not imply
AI.
and two sentences before
>> A lossy text compressor that did the
same thing (recall it in paraphrased fashion) would certainly
demonstrate AI.
Require that the decompressing
program be able to output all of the compressed file's knowledge
in ordinary English. This is a pretty trivial task compared to
everything else.
Mark
----- Original Message -----
Sent:
Tuesday, August 15, 2006 12:27 PM
Subject:
Re: Mahoney/Sampo: [agi] Marcus Hutter's lossless compression of human
knowledge prize
I realize it is tempting to use lossy text compression as a test for
AI because that is what the human brain does when we read text and recall
it in paraphrased fashion. We remember the ideas and discard details
about the _expression_ of those ideas. A lossy text compressor that
did the same thing would certainly demonstrate AI.
But there are
two problems with using lossy compression as a test of AI: 1. The test
is subjective. 2. Lossy compression does not imply AI.
Lets
assume we solve the subjectivity problem by having human judges evaluate
whether the decompressed output is "close enough" to the input. We
already do this with lossy image, audio and video compression (without
much consensus).
The second problem remains: ideal lossy
compression does not imply passing the Turing test. For lossless
compression, it can be proven that it does. Let p(s) be the
(unknown) probability that s will be the prefix of a text dialog.
Then a machine that can compute p(s) exactly is able to generate response
A to question Q with the distribution p(QA)/p(Q) which is
indistinguishable from human. The same model minimizes the
compressed size, E[log 1/p(s)].
This proof does not hold for lossy
compression because different lossless models map to identical lossy
models. The desired property of a lossless compressor C is that if
and only if s1 and s2 have the same meaning (to most people), then the
encodings C(s1) = C(s2). This code will ideally have length log
1/(p(s1)+p(s2)). But this does not imply that the decompressor knows
p(s1) or p(s2). Thus, the decompressor may decompress to s1 or s2 or
choose randomly between them. In general, the output distribution
will be different than the true distrubution p(s1), p(s2), so it will be
distinguishable from human even if the compression ratio is
ideal. -- Matt Mahoney, [EMAIL PROTECTED]
-----
Original Message ---- From: Mark Waser
<[EMAIL PROTECTED]> To: agi@v2.listbox.com Sent: Tuesday,
August 15, 2006 9:28:26 AM Subject: Re: Mahoney/Sampo: [agi] Marcus
Hutter's lossless compression of human knowledge prize
>> I don't see any point in this debate over lossless vs. lossy
compression
Lets see if I can simplify it.
- The stated goal is compressing human
knowledge.
- The exact, same knowledge can always be
expressed in a *VERY* large number of different bit strings
- Not being able to reproduce the exact bit
string is lossy compression when viewed from the bit viewpoint but
can be lossless from the knowledge viewpoint
- Therefore, reproducing the bit string
is an additional requirement above and beyond the stated
goal
- I strongly believe that this additional
requirement will necessitate a *VERY* large amount of additional work
not necessary for the stated goal
- In addition, by information theory,
reproducing the exact bit string will require additional
information beyond the knowledge contained in it (since numerous
different strings can encode the same knowledge)
- Assuming optimal compression, also by
by information theory, additional information will add to the compressed
size (i.e. lead to a less optimal result).
So the question is "Given that bit-level
reproduction is harder, not necessary for knowledge
compression/intelligence, and doesn't allow for the same degree of
compression. Why make life tougher when it isn't necessary for
your stated purposes and makes your results (i.e. compression)
worse?"
-----
Original Message -----
Sent:
Tuesday, August 15, 2006 12:55 AM
Subject:
Re: Sampo: [agi] Marcus Hutter's lossless compression of human knowledge
prize
Where
will the knowledge to compress text come from? There are 3
possibilities. 1. externally supplied, like the lexical models
(dictionaries) for paq8h and WinRK. 2. learned from the input in a
separate pass, like xml-wrt|ppmonstr. 3. learned online in one pass,
like paq8f and slim. These all have the same effect on compressed
size. In the first case, you increase the size of the
decompressor. In the second, you have to append the model you
learned from the first pass to the compressed file so it is available to
the decompressor. In the third case, compression is poor at the
beginning. From the viewpoint of information theory, there is no
difference in these three approaches. The penalty is the
same. To improve compression further, you will need to model
semantics and/or syntax. No compressor currently does this.
I think the reason is that it is not worthwhile unless you have hundreds
of megabytes of natural language text. In fact, only the top few
compressors even have lexical models. All the rest are byte
oriented n-gram models. A semantic model would know what words
are related, like "star" and "moon". It would learn this by their
tendency to appear together. You can build a dictionary of such
knowledge from the data set itself or you can build it some other way
(such as Wordnet) and include it in the decompressor. If you learn
it from the input, you could do it in a separate pass (like LSA) or you
could do it in one pass (maybe an equivalent neural network) so that you
build the model as you compress. To learn syntax, you can cluster
words by similarity of their immediate context. These clusters
correspond to part of speech. For instance, "the X is" tells you
that X is a noun. You can model simple grammars as n-grams over
their classifications, such as (Art Noun Verb). Again, you can use
any of 3 approaches. Learning semantics and syntax is a hard
problem, but I think you can see it can be done with statistical
modeling. The training data you need is in the input
itself. I don't see any point in this debate over lossless vs.
lossy compression. You have to solve the language learning problem
in either case to improve compression. I think it will be more
productive to discuss how this can be done.
-- Matt Mahoney, [EMAIL PROTECTED]
To
unsubscribe, change your address, or temporarily deactivate your
subscription, please go to http://v2.listbox.com/member/[EMAIL PROTECTED]
To
unsubscribe, change your address, or temporarily deactivate your
subscription, please go to http://v2.listbox.com/member/[EMAIL PROTECTED]
To
unsubscribe, change your address, or temporarily deactivate your
subscription, please go to http://v2.listbox.com/member/[EMAIL PROTECTED]
To
unsubscribe, change your address, or temporarily deactivate your
subscription, please go to http://v2.listbox.com/member/[EMAIL PROTECTED]
To
unsubscribe, change your address, or temporarily deactivate your
subscription, please go to http://v2.listbox.com/member/[EMAIL PROTECTED]
To
unsubscribe, change your address, or temporarily deactivate your subscription,
please go to http://v2.listbox.com/member/[EMAIL PROTECTED]
To unsubscribe, change your address, or temporarily deactivate your
subscription, please go to http://v2.listbox.com/member/[EMAIL PROTECTED]
To unsubscribe, change your address, or temporarily deactivate your subscription,
please go to http://v2.listbox.com/member/[EMAIL PROTECTED]
|