I just remembered now on minor thing that made our life easier, recusive loop has some primitive stripEndings() method that removes most of variable endings all these ungs/ungen/... before looking up in SuffixTree. This reduces your dictionary needs dramatically. I think this is partially done in GermanStemmer in Lucene...
ahh, another one, when you strip suffix, check if last char on remaining "stem" is "s" (magic thing in German), delete it if not the only letter.... do not ask why, long unexplained mistery of German language this approach works in 99% cases, and special linguistic tricks are anyhow not so relevant for most situations for searching. Regular stemmer makes much greater distorsion than this Must find this code somewhere, I probably left something out in these emails ----- Original Message ---- From: eks dev <[EMAIL PROTECTED]> To: [email protected] Sent: Tuesday, 19 September, 2006 10:15:04 PM Subject: Re: Analysis/tokenization of compound words Hi Otis, Depends what yo need to do with it, if you need this to be only used as "kind of stemming" for searching documents, solution is not all that complex. If you need linguisticly correct splitting than it gets complicated. for the first case: Build SuffixTree with your dictionary (hope you have many inflections for german words in your dictionary...(feminin, masculin, plural, n-ending, 4 cases...), Tanzerin Tanzer). find longest suffix that is in your dictionary and recursively strip word that ends original word... It is fast. If I remember correctly, in lucene util is some SuffixTree implementation (not really good for large dictionaries) Thigs to be aware of, your recall will drop down in case you use simple fuzzy things that are normally found. - "Balletttänzerin" -> "Ballett" "tänzerin", so if your request does not get split due to typos no chance to find it, e.g. "Ballettänzerim"->"Ballettänzerim" - You need good dictionary with all inflections (google morphy or something like this to help you generate all forms ) - try to be carefull with short prefix in this case as this leads to totally wrong splitting "umbau"->"um" "bau" (changes emning, and if you have preposition "um" as stopword...) For beter solutions that would cover fuzzy errors, contact Bob Carpenter from Alias-I, his SpellChecker can do this rather easily, unfortunatelly (for us) for money (Warning: I am in no relatatin to Bob or Alias-I at all)... Daniel Naber made some work with German dictionaries as well, if I recall well, maybe he has something that helps Anyhow, if you opt for the first option, I will try to dig something out in our archives, we did something similar ages ago ("stemming like" splitting of word in German) Have fun, e. ----- Original Message ---- From: Otis Gospodnetic <[EMAIL PROTECTED]> To: [email protected] Sent: Tuesday, 19 September, 2006 6:21:55 PM Subject: Analysis/tokenization of compound words Hi, How do people typically analyze/tokenize text with compounds (e.g. German)? I took a look at GermanAnalyzer hoping to see how one can deal with that, but it turns out GermanAnalyzer doesn't treat compounds in any special way at all. One way to go about this is to have a word dictionary and a tokenizer that processes input one character at a time, looking for a word match in the dictionary after each processed characters. Then, CompoundWordLikeThis could be broken down into multiple tokens/words and returned at a set of tokens at the same position. However, somehow this doesn't strike me as a very smart and fast approach. What are some better approaches? If anyone has implemented anything that deals with this problem, I'd love to hear about it. Thanks, Otis --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
