eks dev wrote:
Depends what yo need to do with it, if you need this to be only used as "kind of stemming" for searching documents, solution is not all that complex. If you need linguisticly correct splitting than it gets complicated.
This is a very good point. Stemming for high recall is much easier than fine-grained linguistic morphology. Often the best solution is a combination of best-guess based on linguistic rules/statistical models/heuristics combined with weaker substring measures.
For beter solutions that would cover fuzzy errors, contact Bob Carpenter from Alias-I, his SpellChecker can do this rather easily, unfortunatelly (for us) for money (Warning: I am in no relatatin to Bob or Alias-I at all)...
The implementation we have is a simple character-level noisy channel model. We even have a tutorial for how to do this in Chinese: http://www.alias-i.com/lingpipe/demos/tutorial/chineseTokens/read-me.html As pointed out in another thread, this requires a set of training data consisting of the parts of the German words. And you may need to allow things other than spaces to be dropped in cases of epenthesis (adding a vowel between words). It's also possible to bootstrap directly from raw data, though only for the stemming for high recall case -- you won't get close to the true morphology this way. Just to clarify, our LingPipe license is a dual royalty-free/commercial license. Our source is downloadable online. The royalty free license is very much like GPL with the added restriction that you have to make public the data over which you run LingPipe. - Bob Carpenter Alias-i --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
