Hi G�nter,
I had a similar requirement for my use of Lucene. We have documents with mixed
languages, some of the text in the user's native language and some in English. We made
the decision to not use any of the stemming analyzers and index with no stop words (I
didn't like the no stop words decision, but it wasn't really my call). My analyzer
tokenStream method:
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new StandardTokenizer(reader);
result = new StandardFilter(result);
result = new LowerCaseFilter(result);
return result;
}
Do you really need stemming in your application? Do you really need stop words?
See this note http://archives.apache.org/eyebrowse/[EMAIL PROTECTED]&msgId=653731 for
a discussion about the advantages/disadvantages of stemming.
If you still want stop words, you can create a list that includes words from more than
one language, then use the same analyzer for all of your content.
If you still need stemming, you will probably have to give your user the ability to
tell you which language index they wish to search and you would probably be better off
maintaining separate indices for each language at that point.
Best of luck,
Eric
-----Original Message-----
From: G�nter Kukies [mailto:[EMAIL PROTECTED]
Sent: Thursday, March 06, 2003 2:08 AM
To: Lucene Users List
Subject: Multi Language support
Hello,
that is what I know about indexing international documents:
1. I have a language ID
2. with this ID I choose an special Analzer for that language
3. I can use one index for all languages
But what about searching for international documents?
I don't have a language ID, because the user is interested in documents with his
native language and a second language mostly english. So, what Analyzer do I use for
searching?
Thanks
G�nter
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]