Linas, Zelalem, Tensae, Hedra, Ruiting, etc.,

This email summarizes some of what I remember from our discussion in
Addis of unsupervised learning of syntax in languages with complex
morphology (e.g. Amharic)...

First point, due to Linas, was that to handle words that consist of
multiple morphemes, we want to

Step 1) split the word into multiple morphemes

Step 2) do our "word-level" grammar-learning process on these
morphemes instead of on the words per se

The complexity noted was that splitting a word into multiple morphemes
is difficult in Amharic and many other languages.   For one thing an
Amharic word can have multiple prefixes and suffixes; and it also can
have infixes.... It seems supervised learning algorithms currently get
80-something-percent accuracy on Amharic morphology, which is pretty
mediocre for learning from an annotated corpus...

So if we have some uncertainty about the best way to split a word into
morphemes, then we may end up with multiple potential splits of a word
into morphemes, which may yield a combinatorial explosion.  If one has
N words in a sentence, and each one can be split in M plausible ways,
then we have at worst M^N possible sentences to do our morpheme-level
parsing on.  Whoops!

Of course, this explosion could be resolved in various ways, e.g. by
doing the splitting word-by-word incrementally as one proceeds thru
the sentence, and probabilistically weighting each possible split....
In this way for most sentences the explosion would not be so bad, it
would be terrible only for especially confusing sentences (which would
reflect human sentence interpretation I suppose)...

We discussed how to do the splitting using character-level link parser
dictionary learning, i.e. by applying our language learning algorithm
at the character level (as a preliminary phase to applying it at the
word level).   Basically one would apply the grammar learning
algorithm at a lower level: treating characters as if they were words,
and words as if they were sentences.   One would then get "link
parses" of the words, where the links in the parse of a word spanned
between characters in the word.

One could extract morphemes from these word-parses as follows: A
morpheme would be a subtree of the maximum-attraction spanning tree of
the links between the characters in the word, with the property that
the size-normalized interaction-information (or similar) of the
subtree exceeds that of its sub-subtrees or super-subtrees....   I.e.
a morpheme is an in-a-sense "maximally informative" subtree of the
maximum-attraction spanning tree of the characters in the word.  This
will obviously need some tuning, but one can use an annotated corpus
of morphemes in a given language to tune this extraction measure.

Part of the above process may involve creating nodes for (not
necessarily contiguous, given infixes) *character-sets*  within a
word, and calculating (symmetric or asymmetric) attraction between
these character-sets and other characters or character-sets.   Linas
points out that this is analogous to creating nodes for phrases in a
sentence, in ordinary sentence-level link parsing; and then
calculating probabilities regarding these phrases and their attraction
to other words, phrases, categories, etc.

In short it appears likely that one can handle languages with complex
morphology via applying the same grammar-learning methodology (that we
are now experimenting with on the sentence level) to learn the grammar
of relationship btw characters within a word.  There will then be an
added phase of morpheme extraction from these "word-level parses",
which will tell us the most probable places to split a word into
morphemes.   Knowing probability weights for some likely split-points
should quell the combinatorial explosion, and doing parsing in a
forward-going manner like natural reading should quell it even more
given the nature of real-world sentences...

Just wanted to get the contents of the discussion down in writing
before it got too badly lost in my memory...

-- Ben







-- 
Ben Goertzel, PhD
http://goertzel.org

“Our first mothers and fathers … were endowed with intelligence; they
saw and instantly they could see far … they succeeded in knowing all
that there is in the world. When they looked, instantly they saw all
around them, and they contemplated in turn the arch of heaven and the
round face of the earth. … Great was their wisdom …. They were able to
know all....

But the Creator and the Maker did not hear this with pleasure. … ‘Are
they not by nature simple creatures of our making? Must they also be
gods? … What if they do not reproduce and multiply?’

Then the Heart of Heaven blew mist into their eyes, which clouded
their sight as when a mirror is breathed upon. Their eyes were covered
and they could see only what was close, only that was clear to them.”

— Popol Vuh (holy book of the ancient Mayas)

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CACYTDBc1Z1NQUezEeojEamKvMn4nY8RZV6iq8ESVMjSMGUFcEg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to