On 8/28/06, Mark Waser  wrote: 
>How does a lossless model observe that "Jim is  extremely fat" and "James  
>continues to be morbidly obese" are approximately  equal? 
 
I realize this is far beyond the capabilities of current data compression 
programs, which typically predict the next byte in the context of the last few 
bytes using learned statistics.  Of course we must do better.  The model has to 
either know, or be able to learn, the relationships between "Jim" and "James", 
"is" and "continues to be", "fat" and "obese", etc.  I think a 1 GB corpus is 
big enough to learn most of this knowledge using statistical methods. 
 
C:\res\data\wiki>grep -c . enwik9 
 File enwik9: 
 10920493 lines match 
 enwik9: grep: input lines truncated - result questionable 
  
C:\res\data\wiki>grep -i -c " fat " enwik9 
 File enwik9: 
 1312 lines match 
 enwik9: grep: input lines truncated - result questionable 
  
 C:\res\data\wiki>grep -i -c " obese " enwik9 
File enwik9: 
111 lines match 
enwik9: grep: input lines truncated - result questionable 
 
C:\res\data\wiki>grep -i " obese " enwik9 |grep -c " fat " 
 File STDIN: 
 14 lines match 
  
So we know that "obese" occurs in about 0.001% of all paragraphs, but in 1% of 
paragraphs containing "fat".  This is an example of a distant bigram model, 
which has been shown to improve word perplexity in offline models [1].  We can 
improve on this method using e.g. latent semantic analysis [2] to exploit the 
transitive property of semantics: if A appears near (means) B and B appears 
near C, then A predicts C. 
 
Likewise, syntax is learnable.  For example, if you encounter "the X is" you 
know that X is a noun, so you can predict "a X was" or "Xs" rather than "he X" 
or "Xed".  This type of knowledge can be exploited using similarity modeling 
[3] to improve word preplexity.  (Thanks to Rob Freeman for pointing me to 
this).

Let me give one more example using the same learning mechanism by which syntax 
is learned:

All men are mortal.  Socrates is a man.  Therefore Socrates is mortal.
All insects have 6 legs.  Ants are insects.  Therefore ants have 6 legs.

Now predict: All frogs are green.  Kermit is a frog.  Therefore...


[1] Rosenfeld, Ronald, "A Maximum Entropy Approach to Adaptive Statistical 
Language Modeling", Computer, Speech and Language, 10, 1996. 
 
[2] Bellegarda, Jerome R., "Speech recognition experiments using multi-span 
statistical language models", IEEE Intl. Conf. on Acoustics, Speech, and Signal 
Processing, 717-720, 1999. 
 
[3] Ido Dagan, Lillian Lee, Fernando C. N. Pereira, Similarity-Based Models of 
Word Cooccurrence Probabilities, Machine Learning, 1999.   
http://citeseer.ist.psu.edu/dagan99similaritybased.html 
  
-- Matt Mahoney, [EMAIL PROTECTED] 
 
 


-------
To unsubscribe, change your address, or temporarily deactivate your 
subscription, 
please go to http://v2.listbox.com/member/[EMAIL PROTECTED]

Reply via email to