I think a 1 GB corpus is big enough to learn most of this knowledge using
statistical methods.
So we know that "obese" occurs in about 0.001% of all paragraphs, but in
1% of paragraphs containing "fat".
OK. Now try "obese" and "morbidly" or "obese" and "clinically". I suspect
that you are far more likely to statistically end up with "obese" being some
form a disease (that being the context where is normally used) than it is to
end up as "fat". Statistical methods get absolutely trashed when you start
switching contexts unless they can tell (or more likely, are told) that
you've switched contexts. They are great at pulling context-specific
clusters out of specific contexts but unless you get cross-context
explanatory data (that you'll probably interpret with "other than
statistical methods -- see next section"), I don't believe that statistical
methods will recognize obese and fat as synonyms.
Likewise, syntax is learnable. For example, if you encounter "the X is"
you know that X is a noun, so you can predict "a X was" or "Xs" rather
than "he X" or "Xed". This type of knowledge can be exploited using
similarity modeling [3] to improve word preplexity.
Let me give one more example using the same learning mechanism by which
syntax is learned:
All men are mortal. Socrates is a man. Therefore Socrates is mortal.
All insects have 6 legs. Ants are insects. Therefore ants have 6 legs.
Now predict: All frogs are green. Kermit is a frog. Therefore...
This isn't a statistical method (see "other than statistical methods" above
:-).
= = = = =
So -- No, I *don't* believe that the 1GB corpus is big enough to learn most
of this knowledge *USING STATISTICAL METHODS*. I *do* believe that it is
large enough for other methods though.
----- Original Message -----
From: "Matt Mahoney" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Monday, August 28, 2006 3:37 PM
Subject: Re: [agi] Lossy *&* lossless compressi
On 8/28/06, Mark Waser wrote:
How does a lossless model observe that "Jim is extremely fat" and "James
continues to be morbidly obese" are approximately equal?
I realize this is far beyond the capabilities of current data compression
programs, which typically predict the next byte in the context of the last
few bytes using learned statistics. Of course we must do better. The
model has to either know, or be able to learn, the relationships between
"Jim" and "James", "is" and "continues to be", "fat" and "obese", etc. I
think a 1 GB corpus is big enough to learn most of this knowledge using
statistical methods.
C:\res\data\wiki>grep -c . enwik9
File enwik9:
10920493 lines match
enwik9: grep: input lines truncated - result questionable
C:\res\data\wiki>grep -i -c " fat " enwik9
File enwik9:
1312 lines match
enwik9: grep: input lines truncated - result questionable
C:\res\data\wiki>grep -i -c " obese " enwik9
File enwik9:
111 lines match
enwik9: grep: input lines truncated - result questionable
C:\res\data\wiki>grep -i " obese " enwik9 |grep -c " fat "
File STDIN:
14 lines match
So we know that "obese" occurs in about 0.001% of all paragraphs, but in
1% of paragraphs containing "fat". This is an example of a distant bigram
model, which has been shown to improve word perplexity in offline models
[1]. We can improve on this method using e.g. latent semantic analysis
[2] to exploit the transitive property of semantics: if A appears near
(means) B and B appears near C, then A predicts C.
Likewise, syntax is learnable. For example, if you encounter "the X is"
you know that X is a noun, so you can predict "a X was" or "Xs" rather
than "he X" or "Xed". This type of knowledge can be exploited using
similarity modeling [3] to improve word preplexity. (Thanks to Rob
Freeman for pointing me to this).
Let me give one more example using the same learning mechanism by which
syntax is learned:
All men are mortal. Socrates is a man. Therefore Socrates is mortal.
All insects have 6 legs. Ants are insects. Therefore ants have 6 legs.
Now predict: All frogs are green. Kermit is a frog. Therefore...
[1] Rosenfeld, Ronald, "A Maximum Entropy Approach to Adaptive Statistical
Language Modeling", Computer, Speech and Language, 10, 1996.
[2] Bellegarda, Jerome R., "Speech recognition experiments using
multi-span statistical language models", IEEE Intl. Conf. on Acoustics,
Speech, and Signal Processing, 717-720, 1999.
[3] Ido Dagan, Lillian Lee, Fernando C. N. Pereira, Similarity-Based
Models of Word Cooccurrence Probabilities, Machine Learning, 1999.
http://citeseer.ist.psu.edu/dagan99similaritybased.html
-- Matt Mahoney, [EMAIL PROTECTED]
-------
To unsubscribe, change your address, or temporarily deactivate your
subscription,
please go to http://v2.listbox.com/member/[EMAIL PROTECTED]
-------
To unsubscribe, change your address, or temporarily deactivate your subscription,
please go to http://v2.listbox.com/member/[EMAIL PROTECTED]