I am interested in identifying barriers to language modeling and how to 
overcome them.

I have no doubt that probabilistic models such as NARS and Novamente can 
adequately represent human knowledge.  Also, I have no doubt they can learn 
e.g. relations such as "all frogs are green" from examples of green frogs.  My 
question relates to solving the language problem: how to convert natural 
language statements like "frogs are green" and equivalent variants into the 
formal internal representation without the need for humans to encode stuff like 
(for all X, frog(X) => green(X)).  This problem is hard because there might not 
be terms that exactly correspond to "frog" or "green", and also because 
interpreting natural language statements is not always straightforward, e.g. "I 
know it was either a frog or a leaf because it was green".

Converting natural language to a formal representation requires language 
modeling at the highest level.  The levels from lowest to highest are: 
phonemes, word segmentation rules, semantics, simple sentences, compound 
sentences.  Regardless of whether your child learned to read at age 3 or not at 
all, children always learn language in this order.

The state of the art in language modeling is at the level of simple sentences, 
modeling syntax using n-grams (usually trigrams) or hidden Markov models 
generally without recursion (flat), and modeling semantics as word 
associations, possibly generalizing via LSA or clustering to exploit the 
transitive property (if A means B and B means C, then A means C).  This is the 
level of modeling of the top text compressors on the large text benchmark and 
the lowest perplexity models used in speech recognition.  I gave an example of 
a Google translation of English to Arabic and back.  You may have noticed that 
strings of up to about 6 words looked grammatically correct, but that longer 
sequences contained errors.  This is a characteristic of trigram models.  
Shannon noted in 1949 that random sequences that fit the n-gram (letter or 
word) statistics of English appear correct up to about 2n.

All of these models have the property that they are trained in the same order 
that children learn language.  For example, parsing sentences without semantics 
is difficult, but extracting semantics without parsing (text search) is easy.  
As a second example, it is possible to build a lexicon from text only if you 
know the rules for word segmentation.  However, the reverse is not true.  It is 
not necessary to have a lexicon to segment continuous text (spaces removed).  
The segmentation rules can be derived from n-gram statistics, analogous to 
learning the phonological rules for segmenting continuous speech.  This was 
first demonstrated in text by Hutchens and Alder, which I improved on in 1999.  
http://cs.fit.edu/~mmahoney/dissertation/lex1.html

With this observation, it seems that hard coding rules for inheritance, 
equivalence, logical, temporal etc. relations, into a knowledge representation 
will not help in learning these relations from text.  The language model still 
has to learn these relations from previously learned, simpler concepts.  In 
other words, the model has to learn the meanings of "is", "and", "not", 
"if-then", "all", "before", etc. without any help from the structure of the 
knowledge represenation or explicit encoding.  The model has to first learn how 
to convert compound sentences into a formal representation and back, and only 
then can it start using or adding to the knowledge base.

So my question is: what is needed to extend language models to the level of 
compound sentences?  More training data?  Different training data?  A new 
theory of language acquisition?  More hardware?  How much?

-- Matt Mahoney, [EMAIL PROTECTED]


-----
This list is sponsored by AGIRI: http://www.agiri.org/email
To unsubscribe or change your options, please go to:
http://v2.listbox.com/member/[EMAIL PROTECTED]

Reply via email to