On Sat, Oct 27, 2012 at 1:53 PM, Tom <fivemile...@gmail.com> wrote: > Hello, > > using Lucene 4.0.0b, I am trying to get a superset of all stop words (for > an international app). > I have looked around, and not found anything specific. Is this the way to go? > > CharArraySet internationalSet = new CharArraySet(Version.LUCENE_40, 10000, > false); > internationalSet.addAll(ArabicAnalyzer.getDefaultStopSet()); > internationalSet.addAll(BulgarianAnalyzer.getDefaultStopSet());
This seems like a bad idea because you're going to eventually hit a word which is a stop word in one language which is important for someone in another. Even working solely in English, it didn't take us long to find a stop word which one user actually wanted to search for... For international purposes, I would just avoid using stop words. You're going to have more than enough pain just coming up with a sensible analysis path (advance warning: in any given language people will complain about some feature in StandardAnalyzer.) I assume people still recommend one field per language with a different analyser on each, which pushes the problem to query generation time (how the user specifies which language they're searching for.) TX --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org