I think a 1 GB corpus is big enough to learn most of this knowledge using statistical methods. So we know that "obese" occurs in about 0.001% of all paragraphs, but in 1% of paragraphs containing "fat".

OK. Now try "obese" and "morbidly" or "obese" and "clinically". I suspect that you are far more likely to statistically end up with "obese" being some form a disease (that being the context where is normally used) than it is to end up as "fat". Statistical methods get absolutely trashed when you start switching contexts unless they can tell (or more likely, are told) that you've switched contexts. They are great at pulling context-specific clusters out of specific contexts but unless you get cross-context explanatory data (that you'll probably interpret with "other than statistical methods -- see next section"), I don't believe that statistical methods will recognize obese and fat as synonyms.

Likewise, syntax is learnable. For example, if you encounter "the X is" you know that X is a noun, so you can predict "a X was" or "Xs" rather than "he X" or "Xed". This type of knowledge can be exploited using similarity modeling [3] to improve word preplexity. Let me give one more example using the same learning mechanism by which syntax is learned:
All men are mortal.  Socrates is a man.  Therefore Socrates is mortal.
All insects have 6 legs.  Ants are insects.  Therefore ants have 6 legs.
Now predict: All frogs are green.  Kermit is a frog.  Therefore...

This isn't a statistical method (see "other than statistical methods" above :-).

= = = = =

So -- No, I *don't* believe that the 1GB corpus is big enough to learn most of this knowledge *USING STATISTICAL METHODS*. I *do* believe that it is large enough for other methods though.


----- Original Message ----- From: "Matt Mahoney" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Monday, August 28, 2006 3:37 PM
Subject: Re: [agi] Lossy *&* lossless compressi


On 8/28/06, Mark Waser  wrote:
How does a lossless model observe that "Jim is  extremely fat" and "James
continues to be morbidly obese" are approximately  equal?

I realize this is far beyond the capabilities of current data compression programs, which typically predict the next byte in the context of the last few bytes using learned statistics. Of course we must do better. The model has to either know, or be able to learn, the relationships between "Jim" and "James", "is" and "continues to be", "fat" and "obese", etc. I think a 1 GB corpus is big enough to learn most of this knowledge using statistical methods.

C:\res\data\wiki>grep -c . enwik9
File enwik9:
10920493 lines match
enwik9: grep: input lines truncated - result questionable

C:\res\data\wiki>grep -i -c " fat " enwik9
File enwik9:
1312 lines match
enwik9: grep: input lines truncated - result questionable

C:\res\data\wiki>grep -i -c " obese " enwik9
File enwik9:
111 lines match
enwik9: grep: input lines truncated - result questionable

C:\res\data\wiki>grep -i " obese " enwik9 |grep -c " fat "
File STDIN:
14 lines match

So we know that "obese" occurs in about 0.001% of all paragraphs, but in 1% of paragraphs containing "fat". This is an example of a distant bigram model, which has been shown to improve word perplexity in offline models [1]. We can improve on this method using e.g. latent semantic analysis [2] to exploit the transitive property of semantics: if A appears near (means) B and B appears near C, then A predicts C.

Likewise, syntax is learnable. For example, if you encounter "the X is" you know that X is a noun, so you can predict "a X was" or "Xs" rather than "he X" or "Xed". This type of knowledge can be exploited using similarity modeling [3] to improve word preplexity. (Thanks to Rob Freeman for pointing me to this).

Let me give one more example using the same learning mechanism by which syntax is learned:

All men are mortal.  Socrates is a man.  Therefore Socrates is mortal.
All insects have 6 legs.  Ants are insects.  Therefore ants have 6 legs.

Now predict: All frogs are green.  Kermit is a frog.  Therefore...


[1] Rosenfeld, Ronald, "A Maximum Entropy Approach to Adaptive Statistical Language Modeling", Computer, Speech and Language, 10, 1996.

[2] Bellegarda, Jerome R., "Speech recognition experiments using multi-span statistical language models", IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, 717-720, 1999.

[3] Ido Dagan, Lillian Lee, Fernando C. N. Pereira, Similarity-Based Models of Word Cooccurrence Probabilities, Machine Learning, 1999. http://citeseer.ist.psu.edu/dagan99similaritybased.html

-- Matt Mahoney, [EMAIL PROTECTED]




-------
To unsubscribe, change your address, or temporarily deactivate your subscription,
please go to http://v2.listbox.com/member/[EMAIL PROTECTED]



-------
To unsubscribe, change your address, or temporarily deactivate your subscription, please go to http://v2.listbox.com/member/[EMAIL PROTECTED]

Reply via email to