Re: Analysis/tokenization of compound words

eks dev Tue, 19 Sep 2006 13:41:43 -0700

I just remembered now on minor thing that made our life easier, recusive loop 
has some primitive 
stripEndings() method that removes most of variable endings all these 
ungs/ungen/... before looking up in SuffixTree. This reduces your dictionary 
needs dramatically. I think this is partially done in GermanStemmer in Lucene...


ahh, another one, when you strip suffix, check if last char on remaining "stem" 
is "s" (magic thing in German), delete it if not the only letter.... do not ask 
why, long unexplained mistery of German language

this approach works in 99% cases, and special linguistic tricks are anyhow not 
so relevant for most situations for searching. Regular stemmer makes much 
greater distorsion than this

Must find this code somewhere, I probably left something out in these emails


----- Original Message ----
From: eks dev <[EMAIL PROTECTED]>
To: [email protected]
Sent: Tuesday, 19 September, 2006 10:15:04 PM
Subject: Re: Analysis/tokenization of compound words

Hi Otis,
Depends what yo need to do with it, if you need this to be only used as "kind 
of stemming" for searching documents, solution is not all that complex. If you 
need linguisticly correct splitting than it gets complicated.

for the first case:
Build SuffixTree with your dictionary (hope you have many inflections for 
german words in your dictionary...(feminin, masculin, plural, n-ending, 4 
cases...), Tanzerin Tanzer). find longest suffix that is in your dictionary and 
recursively strip word that ends original word... It is fast.

If I remember correctly, in lucene util is some SuffixTree implementation (not 
really good for large dictionaries)

Thigs to be aware of, your recall will drop down in case you use simple fuzzy 
things that are normally found.

- "Balletttänzerin" -> "Ballett" "tänzerin", so if your request does not get 
split due to typos no chance to find it, e.g. "Ballettänzerim"->"Ballettänzerim"

- You need good dictionary with all inflections (google morphy or something 
like this to help you generate all forms )

- try to be carefull with short prefix in this case as this leads to totally 
wrong splitting "umbau"->"um" "bau" (changes emning, and if you have 
preposition "um" as stopword...)

For beter solutions that would cover fuzzy errors, contact Bob Carpenter from 
Alias-I, his SpellChecker can do this rather easily, unfortunatelly (for us) 
for money (Warning: I am in no relatatin to Bob or Alias-I at all)...

Daniel Naber made some work with German dictionaries as well, if I recall well, 
maybe he has something that helps

Anyhow, if you opt for the first option, I will try to dig something out in our 
archives, we did something similar ages ago ("stemming like" splitting of word 
in German)

Have fun, e.

----- Original Message ----
From: Otis Gospodnetic <[EMAIL PROTECTED]>
To: [email protected]
Sent: Tuesday, 19 September, 2006 6:21:55 PM
Subject: Analysis/tokenization of compound words

Hi,

How do people typically analyze/tokenize text with compounds (e.g. German)?  I 
took a look at GermanAnalyzer hoping to see how one can deal with that, but it 
turns out GermanAnalyzer doesn't treat compounds in any special way at all.

One way to go about this is to have a word dictionary and a tokenizer that 
processes input one character at a time, looking for a word match in the 
dictionary after each processed characters.  Then, CompoundWordLikeThis could 
be broken down into multiple tokens/words and returned at a set of tokens at 
the same position.  However, somehow this doesn't strike me as a very smart and 
fast approach.
What are some better approaches?
If anyone has implemented anything that deals with this problem, I'd love to 
hear about it.

Thanks,
Otis



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Analysis/tokenization of compound words

Reply via email to