On Tue, 2006-09-19 at 09:21 -0700, Otis Gospodnetic wrote: > > How do people typically analyze/tokenize text with compounds (e.g. > German)? I took a look at GermanAnalyzer hoping to see how one can > deal with that, but it turns out GermanAnalyzer doesn't treat > compounds in any special way at all.
I've been looking close at this, but for Swedish. The major problem in that case is how a composite word generally have totaly diffrent sematincs compared to the parts. Here is a classic school example: "En brun hårig sjuk sköterska" A brown hairy sick care taker "En brunhårig sjuksköterska" A brunette nurse Thus it is not very helpful to index the composite parts by them self. It is really a problem to be handled by a spell checker. So I wrote and posted the Jira issue 626, an adaptive session analysing spell checker. It makes recommendations based on how previous users changed their queries. davinci -> da vinci. heroes iii -> heroes 3. And so on. This strategy does however require quite some user traffic. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]