On 8/28/06, Mark Waser wrote: >How does a lossless model observe that "Jim is extremely fat" and "James >continues to be morbidly obese" are approximately equal? I realize this is far beyond the capabilities of current data compression programs, which typically predict the next byte in the context of the last few bytes using learned statistics. Of course we must do better. The model has to either know, or be able to learn, the relationships between "Jim" and "James", "is" and "continues to be", "fat" and "obese", etc. I think a 1 GB corpus is big enough to learn most of this knowledge using statistical methods. C:\res\data\wiki>grep -c . enwik9 File enwik9: 10920493 lines match enwik9: grep: input lines truncated - result questionable C:\res\data\wiki>grep -i -c " fat " enwik9 File enwik9: 1312 lines match enwik9: grep: input lines truncated - result questionable C:\res\data\wiki>grep -i -c " obese " enwik9 File enwik9: 111 lines match enwik9: grep: input lines truncated - result questionable C:\res\data\wiki>grep -i " obese " enwik9 |grep -c " fat " File STDIN: 14 lines match So we know that "obese" occurs in about 0.001% of all paragraphs, but in 1% of paragraphs containing "fat". This is an example of a distant bigram model, which has been shown to improve word perplexity in offline models [1]. We can improve on this method using e.g. latent semantic analysis [2] to exploit the transitive property of semantics: if A appears near (means) B and B appears near C, then A predicts C. Likewise, syntax is learnable. For example, if you encounter "the X is" you know that X is a noun, so you can predict "a X was" or "Xs" rather than "he X" or "Xed". This type of knowledge can be exploited using similarity modeling [3] to improve word preplexity. (Thanks to Rob Freeman for pointing me to this).
Let me give one more example using the same learning mechanism by which syntax is learned: All men are mortal. Socrates is a man. Therefore Socrates is mortal. All insects have 6 legs. Ants are insects. Therefore ants have 6 legs. Now predict: All frogs are green. Kermit is a frog. Therefore... [1] Rosenfeld, Ronald, "A Maximum Entropy Approach to Adaptive Statistical Language Modeling", Computer, Speech and Language, 10, 1996. [2] Bellegarda, Jerome R., "Speech recognition experiments using multi-span statistical language models", IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, 717-720, 1999. [3] Ido Dagan, Lillian Lee, Fernando C. N. Pereira, Similarity-Based Models of Word Cooccurrence Probabilities, Machine Learning, 1999. http://citeseer.ist.psu.edu/dagan99similaritybased.html -- Matt Mahoney, [EMAIL PROTECTED] ------- To unsubscribe, change your address, or temporarily deactivate your subscription, please go to http://v2.listbox.com/member/[EMAIL PROTECTED]