Re: Analysis/tokenization of compound words (German, Chinese, etc.)

Bob Carpenter Tue, 21 Nov 2006 18:24:17 -0800

eks dev wrote:

Depends what yo need to do with it, if you need this to be only used as "kind of 
stemming" for searching documents, solution is not all that complex. If you need 
linguisticly correct splitting than it gets complicated.


This is a very good point.  Stemming for
high recall is much easier than fine-grained
linguistic morphology.

Often the best solution is a combination of
best-guess based on linguistic rules/statistical
models/heuristics combined with weaker substring
measures.

For beter solutions that would cover fuzzy errors, contact Bob Carpenter from 
Alias-I, his SpellChecker can do this rather easily, unfortunatelly (for us) 
for money (Warning: I am in no relatatin to Bob or Alias-I at all)...


The implementation we have is a simple character-level
noisy channel model.  We even have a tutorial for
how to do this in Chinese:

http://www.alias-i.com/lingpipe/demos/tutorial/chineseTokens/read-me.html

As pointed out in another thread, this requires a set of
training data consisting of the parts of the German
words.  And you may need to allow things other than
spaces to be dropped in cases of epenthesis (adding
a vowel between words).

It's also possible to bootstrap directly from
raw data, though only for the stemming for
high recall case -- you won't get close to the
true morphology this way.

Just to clarify, our LingPipe license is a dual
royalty-free/commercial license.  Our source is
downloadable online. The royalty free license
is very much like GPL with the added restriction that you
have to make public the data over which you run LingPipe.

- Bob Carpenter
  Alias-i

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Analysis/tokenization of compound words (German, Chinese, etc.)

Reply via email to