Re: How to handle words that stem to stop words

Tri Cao Mon, 07 Jul 2014 14:07:31 -0700

I think emitting two tokens for "vans" is the right (potentially only) way to 
do it. You could
also control the dictionary of terms that require this special treatment.


Any reason makes you not happy with this approach?

On Jul 06, 2014, at 11:48 AM, Arjen van der Meijden <acmmail...@tweakers.net> 
wrote:

Hello list,

We have a fairly large Lucene database for a 30+ million post forum.Users post and search for all kinds of things. To make sure users don'thave to type exact matches, we combine a WordDelimiterFilter with a(Dutch) SnowballFilter.

Unfortunately users sometimes find examples of words that get stemmed toa word that's basically a stop word. Or reversely, where a very commonword is stemmed so that it becomes the same as a rare word.

We do index stop words, so theoretically they could still find theirresult. But when a rare word is stemmed in such a way it yields amillion hits, that makes it very unusable...

One example is the Dutch word 'van' which is the equivalent of 'of' inEnglish. A user tried to search for the shoe brand 'vans', which getsstemmed to 'van' and obviously gives useless results.

I already noticed the 'KeywordRepeatFilter' to index/search both 'vans'and 'van' and the StemmerOverrideFilter to try and prevent these cases.Are there any other solutions for these kinds of problems?


Best regards,

Arjen van der Meijden

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: How to handle words that stem to stop words

Reply via email to