Arjen, An approach requiring less list maintenance could be more advanced linguistic processing to distinguish the stop word from the content word, such as lemmatization rather than stemming.
A commercial offering, Rosette Search Essentials from Basis <http://www.basistech.com/search-essentials/> (full disclosure: my employer), which is free for development use and can be downloaded via that link, uses textual context to disambiguate lemmas as in the screenshot below -- compare the lemma for token #13 (van) v. token #25 (vans). (I don't read/write Dutch; I took these snippets from the web.) The work integrating OpenNLP <https://issues.apache.org/jira/browse/LUCENE-2899> might also prove helpful. Best, David Murgatroyd ww.linkedin.com/in/dmurga/ <http://www.linkedin.com/in/dmurga/> [image: Inline image 1] On Mon, Jul 7, 2014 at 5:53 PM, Sujit Pal <sujit....@comcast.net> wrote: > Hi Arjen, > > You could also mark a token as "keyword" so the stemmer passes it through > unchanged. For example, per the Javadocs for PorterStemFilter: > > http://lucene.apache.org/core/4_6_0/analyzers-common/org/apache/lucene/analysis/en/PorterStemFilter.html > > Note: This filter is aware of the KeywordAttribute > < > http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/analysis/tokenattributes/KeywordAttribute.html?is-external=true > >. > To prevent certain terms from being passed to the stemmer > KeywordAttribute.isKeyword() > < > http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/analysis/tokenattributes/KeywordAttribute.html?is-external=true#isKeyword() > > > should > be set to true in a previousTokenStream > < > http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/analysis/TokenStream.html?is-external=true > >. > Note: For including the original term as well as the stemmed version, see > KeywordRepeatFilterFactory > < > http://lucene.apache.org/core/4_6_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/KeywordRepeatFilterFactory.html > > > > Assuming your stemmer is also keyword attribute aware, you could build a > filter that reads a list of words (such as "vans") that should be protected > from stemming and marks them with the KeywordAttribute before sending to > the Porter stemmer and put it into your analysis chain. > > -sujit > > > On Mon, Jul 7, 2014 at 2:06 PM, Tri Cao <tm...@me.com> wrote: > > > I think emitting two tokens for "vans" is the right (potentially only) > way > > to do it. You could > > also control the dictionary of terms that require this special treatment. > > > > Any reason makes you not happy with this approach? > > > > On Jul 06, 2014, at 11:48 AM, Arjen van der Meijden < > > acmmail...@tweakers.net> wrote: > > > > Hello list, > > > > We have a fairly large Lucene database for a 30+ million post forum. > > Users post and search for all kinds of things. To make sure users don't > > have to type exact matches, we combine a WordDelimiterFilter with a > > (Dutch) SnowballFilter. > > > > Unfortunately users sometimes find examples of words that get stemmed to > > a word that's basically a stop word. Or reversely, where a very common > > word is stemmed so that it becomes the same as a rare word. > > > > We do index stop words, so theoretically they could still find their > > result. But when a rare word is stemmed in such a way it yields a > > million hits, that makes it very unusable... > > > > One example is the Dutch word 'van' which is the equivalent of 'of' in > > English. A user tried to search for the shoe brand 'vans', which gets > > stemmed to 'van' and obviously gives useless results. > > > > I already noticed the 'KeywordRepeatFilter' to index/search both 'vans' > > and 'van' and the StemmerOverrideFilter to try and prevent these cases. > > Are there any other solutions for these kinds of problems? > > > > Best regards, > > > > Arjen van der Meijden > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > >