I am interested in identifying barriers to language modeling and how to overcome them.
I have no doubt that probabilistic models such as NARS and Novamente can adequately represent human knowledge. Also, I have no doubt they can learn e.g. relations such as "all frogs are green" from examples of green frogs. My question relates to solving the language problem: how to convert natural language statements like "frogs are green" and equivalent variants into the formal internal representation without the need for humans to encode stuff like (for all X, frog(X) => green(X)). This problem is hard because there might not be terms that exactly correspond to "frog" or "green", and also because interpreting natural language statements is not always straightforward, e.g. "I know it was either a frog or a leaf because it was green". Converting natural language to a formal representation requires language modeling at the highest level. The levels from lowest to highest are: phonemes, word segmentation rules, semantics, simple sentences, compound sentences. Regardless of whether your child learned to read at age 3 or not at all, children always learn language in this order. The state of the art in language modeling is at the level of simple sentences, modeling syntax using n-grams (usually trigrams) or hidden Markov models generally without recursion (flat), and modeling semantics as word associations, possibly generalizing via LSA or clustering to exploit the transitive property (if A means B and B means C, then A means C). This is the level of modeling of the top text compressors on the large text benchmark and the lowest perplexity models used in speech recognition. I gave an example of a Google translation of English to Arabic and back. You may have noticed that strings of up to about 6 words looked grammatically correct, but that longer sequences contained errors. This is a characteristic of trigram models. Shannon noted in 1949 that random sequences that fit the n-gram (letter or word) statistics of English appear correct up to about 2n. All of these models have the property that they are trained in the same order that children learn language. For example, parsing sentences without semantics is difficult, but extracting semantics without parsing (text search) is easy. As a second example, it is possible to build a lexicon from text only if you know the rules for word segmentation. However, the reverse is not true. It is not necessary to have a lexicon to segment continuous text (spaces removed). The segmentation rules can be derived from n-gram statistics, analogous to learning the phonological rules for segmenting continuous speech. This was first demonstrated in text by Hutchens and Alder, which I improved on in 1999. http://cs.fit.edu/~mmahoney/dissertation/lex1.html With this observation, it seems that hard coding rules for inheritance, equivalence, logical, temporal etc. relations, into a knowledge representation will not help in learning these relations from text. The language model still has to learn these relations from previously learned, simpler concepts. In other words, the model has to learn the meanings of "is", "and", "not", "if-then", "all", "before", etc. without any help from the structure of the knowledge represenation or explicit encoding. The model has to first learn how to convert compound sentences into a formal representation and back, and only then can it start using or adding to the knowledge base. So my question is: what is needed to extend language models to the level of compound sentences? More training data? Different training data? A new theory of language acquisition? More hardware? How much? -- Matt Mahoney, [EMAIL PROTECTED] ----- This list is sponsored by AGIRI: http://www.agiri.org/email To unsubscribe or change your options, please go to: http://v2.listbox.com/member/[EMAIL PROTECTED]
