RE: Multi Language support

Eric Isakson Thu, 06 Mar 2003 06:56:55 -0800

Hi G�nter,

I had a similar requirement for my use of Lucene. We have documents with mixed 
languages, some of the text in the user's native language and some in English. We made 
the decision to not use any of the stemming analyzers and index with no stop words (I 
didn't like the no stop words decision, but it wasn't really my call). My analyzer 
tokenStream method:


    public TokenStream tokenStream(String fieldName, Reader reader) {
        TokenStream result = new StandardTokenizer(reader);
        result = new StandardFilter(result);
        result = new LowerCaseFilter(result);
        return result;
    }

Do you really need stemming in your application? Do you really need stop words?

See this note http://archives.apache.org/eyebrowse/[EMAIL PROTECTED]&msgId=653731 for 
a discussion about the advantages/disadvantages of stemming.

If you still want stop words, you can create a list that includes words from more than 
one language, then use the same analyzer for all of your content.

If you still need stemming, you will probably have to give your user the ability to 
tell you which language index they wish to search and you would probably be better off 
maintaining separate indices for each language at that point.

Best of luck,
Eric


-----Original Message-----
From: G�nter Kukies [mailto:[EMAIL PROTECTED] 
Sent: Thursday, March 06, 2003 2:08 AM
To: Lucene Users List
Subject: Multi Language support


Hello,

that is what I know about indexing international documents:

1. I have a language ID
2. with this ID I choose an special Analzer for that language 
3. I can use one index for all languages

But what about searching for international documents?

I don't have a language ID, because the user is interested in documents with his 
native language and a second language mostly english. So, what Analyzer do I use for 
searching?


Thanks

G�nter

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Multi Language support

Reply via email to