Linas, Zelalem, Tensae, Hedra, Ruiting, etc., This email summarizes some of what I remember from our discussion in Addis of unsupervised learning of syntax in languages with complex morphology (e.g. Amharic)...
First point, due to Linas, was that to handle words that consist of multiple morphemes, we want to Step 1) split the word into multiple morphemes Step 2) do our "word-level" grammar-learning process on these morphemes instead of on the words per se The complexity noted was that splitting a word into multiple morphemes is difficult in Amharic and many other languages. For one thing an Amharic word can have multiple prefixes and suffixes; and it also can have infixes.... It seems supervised learning algorithms currently get 80-something-percent accuracy on Amharic morphology, which is pretty mediocre for learning from an annotated corpus... So if we have some uncertainty about the best way to split a word into morphemes, then we may end up with multiple potential splits of a word into morphemes, which may yield a combinatorial explosion. If one has N words in a sentence, and each one can be split in M plausible ways, then we have at worst M^N possible sentences to do our morpheme-level parsing on. Whoops! Of course, this explosion could be resolved in various ways, e.g. by doing the splitting word-by-word incrementally as one proceeds thru the sentence, and probabilistically weighting each possible split.... In this way for most sentences the explosion would not be so bad, it would be terrible only for especially confusing sentences (which would reflect human sentence interpretation I suppose)... We discussed how to do the splitting using character-level link parser dictionary learning, i.e. by applying our language learning algorithm at the character level (as a preliminary phase to applying it at the word level). Basically one would apply the grammar learning algorithm at a lower level: treating characters as if they were words, and words as if they were sentences. One would then get "link parses" of the words, where the links in the parse of a word spanned between characters in the word. One could extract morphemes from these word-parses as follows: A morpheme would be a subtree of the maximum-attraction spanning tree of the links between the characters in the word, with the property that the size-normalized interaction-information (or similar) of the subtree exceeds that of its sub-subtrees or super-subtrees.... I.e. a morpheme is an in-a-sense "maximally informative" subtree of the maximum-attraction spanning tree of the characters in the word. This will obviously need some tuning, but one can use an annotated corpus of morphemes in a given language to tune this extraction measure. Part of the above process may involve creating nodes for (not necessarily contiguous, given infixes) *character-sets* within a word, and calculating (symmetric or asymmetric) attraction between these character-sets and other characters or character-sets. Linas points out that this is analogous to creating nodes for phrases in a sentence, in ordinary sentence-level link parsing; and then calculating probabilities regarding these phrases and their attraction to other words, phrases, categories, etc. In short it appears likely that one can handle languages with complex morphology via applying the same grammar-learning methodology (that we are now experimenting with on the sentence level) to learn the grammar of relationship btw characters within a word. There will then be an added phase of morpheme extraction from these "word-level parses", which will tell us the most probable places to split a word into morphemes. Knowing probability weights for some likely split-points should quell the combinatorial explosion, and doing parsing in a forward-going manner like natural reading should quell it even more given the nature of real-world sentences... Just wanted to get the contents of the discussion down in writing before it got too badly lost in my memory... -- Ben -- Ben Goertzel, PhD http://goertzel.org “Our first mothers and fathers … were endowed with intelligence; they saw and instantly they could see far … they succeeded in knowing all that there is in the world. When they looked, instantly they saw all around them, and they contemplated in turn the arch of heaven and the round face of the earth. … Great was their wisdom …. They were able to know all.... But the Creator and the Maker did not hear this with pleasure. … ‘Are they not by nature simple creatures of our making? Must they also be gods? … What if they do not reproduce and multiply?’ Then the Heart of Heaven blew mist into their eyes, which clouded their sight as when a mirror is breathed upon. Their eyes were covered and they could see only what was close, only that was clear to them.” — Popol Vuh (holy book of the ancient Mayas) -- You received this message because you are subscribed to the Google Groups "opencog" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/opencog. To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CACYTDBc1Z1NQUezEeojEamKvMn4nY8RZV6iq8ESVMjSMGUFcEg%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
