On Mon, Jun 9, 2014 at 7:57 PM, Jamie <ja...@mailarchiva.com> wrote: > Greetings > > Our app currently uses language specific analysers (e.g. EnglishAnalyzer, > GermanAnalyzer, etc.). We need an option to disable stemming. What's the > recommended way to do this? These analyzers do not include an option to > disable stemming, only a parameter to specify a list words for which > stemming should not apply. > Furthermore, my understanding is that the StandardAnalyzer is tied to > English specifically.
I would say that StandardAnalyzer is actually a weird mix. UAX#29 (what StandardTokenizer is implementing) has rules which are not convenient for analysing English (e.g. it doesn't break on colons nor underscores) and ultimately if you want English-friendly tokenisation, you should be using additional filters or customising the analyser itself to work around these shortcomings. Presumably EnglishAnalyzer is already working around these (or if it isn't, it should. I don't know, because we don't use it.) > I am trying to avoid having to override each of these analyzers with an option > to disable stemming. Is there a better alternative? Rather than using the Analyzer classes, we use the TokeniserFactory and TokenFilterFactory (actually our own alternatives with the same names - we're still on an older version of Lucene) and a single Analyzer class which is configured by passing in the appropriate factories. Then there is a separate abstraction of analysis language, which takes the stemming setting and whatever other settings you might have, and creates the appropriate list of factories. This way, you still get the reuse, but also gain an additional form of backwards compatibility, since even if you change what filters are used from version to version, you can store off what specific filters were used to create each index. TX --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org