Arjen,

An approach requiring less list maintenance could be more advanced
linguistic processing to distinguish the stop word from the content word,
such as lemmatization rather than stemming.

A commercial offering, Rosette Search Essentials from Basis
<http://www.basistech.com/search-essentials/> (full disclosure: my
employer), which is free for development use and can be downloaded via that
link, uses textual context to disambiguate lemmas as in the screenshot
below -- compare the lemma for token #13 (van) v. token #25 (vans). (I
don't read/write Dutch; I took these snippets from the web.) The work
integrating OpenNLP <https://issues.apache.org/jira/browse/LUCENE-2899>
might also prove helpful.

Best,
David Murgatroyd
ww.linkedin.com/in/dmurga/ <http://www.linkedin.com/in/dmurga/>

[image: Inline image 1]

On Mon, Jul 7, 2014 at 5:53 PM, Sujit Pal <sujit....@comcast.net> wrote:

> Hi Arjen,
>
> You could also mark a token as "keyword" so the stemmer passes it through
> unchanged. For example, per the Javadocs for PorterStemFilter:
>
> http://lucene.apache.org/core/4_6_0/analyzers-common/org/apache/lucene/analysis/en/PorterStemFilter.html
>
> Note: This filter is aware of the KeywordAttribute
> <
> http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/analysis/tokenattributes/KeywordAttribute.html?is-external=true
> >.
> To prevent certain terms from being passed to the stemmer
> KeywordAttribute.isKeyword()
> <
> http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/analysis/tokenattributes/KeywordAttribute.html?is-external=true#isKeyword()
> >
> should
> be set to true in a previousTokenStream
> <
> http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/analysis/TokenStream.html?is-external=true
> >.
> Note: For including the original term as well as the stemmed version, see
> KeywordRepeatFilterFactory
> <
> http://lucene.apache.org/core/4_6_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/KeywordRepeatFilterFactory.html
> >
>
> Assuming your stemmer is also keyword attribute aware, you could build a
> filter that reads a list of words (such as "vans") that should be protected
> from stemming and marks them with the KeywordAttribute before sending to
> the Porter stemmer and put it into your analysis chain.
>
> -sujit
>
>
> On Mon, Jul 7, 2014 at 2:06 PM, Tri Cao <tm...@me.com> wrote:
>
> > I think emitting two tokens for "vans" is the right (potentially only)
> way
> > to do it. You could
> > also control the dictionary of terms that require this special treatment.
> >
> > Any reason makes you not happy with this approach?
> >
> > On Jul 06, 2014, at 11:48 AM, Arjen van der Meijden <
> > acmmail...@tweakers.net> wrote:
> >
> > Hello list,
> >
> > We have a fairly large Lucene database for a 30+ million post forum.
> > Users post and search for all kinds of things. To make sure users don't
> > have to type exact matches, we combine a WordDelimiterFilter with a
> > (Dutch) SnowballFilter.
> >
> > Unfortunately users sometimes find examples of words that get stemmed to
> > a word that's basically a stop word. Or reversely, where a very common
> > word is stemmed so that it becomes the same as a rare word.
> >
> > We do index stop words, so theoretically they could still find their
> > result. But when a rare word is stemmed in such a way it yields a
> > million hits, that makes it very unusable...
> >
> > One example is the Dutch word 'van' which is the equivalent of 'of' in
> > English. A user tried to search for the shoe brand 'vans', which gets
> > stemmed to 'van' and obviously gives useless results.
> >
> > I already noticed the 'KeywordRepeatFilter' to index/search both 'vans'
> > and 'van' and the StemmerOverrideFilter to try and prevent these cases.
> > Are there any other solutions for these kinds of problems?
> >
> > Best regards,
> >
> > Arjen van der Meijden
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>

Reply via email to