On Sat, Oct 27, 2012 at 1:53 PM, Tom <fivemile...@gmail.com> wrote:
> Hello,
>
> using Lucene 4.0.0b, I am trying to get a superset of all stop words (for
> an international app).
> I have looked around, and not found anything specific. Is this the way to go?
>
> CharArraySet internationalSet = new CharArraySet(Version.LUCENE_40, 10000, 
> false);
> internationalSet.addAll(ArabicAnalyzer.getDefaultStopSet());
> internationalSet.addAll(BulgarianAnalyzer.getDefaultStopSet());

This seems like a bad idea because you're going to eventually hit a
word which is a stop word in one language which is important for
someone in another. Even working solely in English, it didn't take us
long to find a stop word which one user actually wanted to search
for...

For international purposes, I would just avoid using stop words.
You're going to have more than enough pain just coming up with a
sensible analysis path (advance warning: in any given language people
will complain about some feature in StandardAnalyzer.)

I assume people still recommend one field per language with a
different analyser on each, which pushes the problem to query
generation time (how the user specifies which language they're
searching for.)

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to