Re: Multy Language documents indexing

Erick Erickson Thu, 22 Feb 2007 05:11:53 -0800

I know this has been discussed several times, but sure don't remember the
answers. Search the mail archive for "multiple languages" and you'll find
some good suggestions. But as I remember, it's not a trivial issue.


But I don't see why the "three different documents" approach wouldn't work.
You could also index the same text in three different fields in a single
document, using different language analyzers for each (See
PerFieldAnalyzerWrapper).....

Erick

On 2/22/07, Ivan Vasilev <[EMAIL PROTECTED]> wrote:


Hi All,

Our application that uses Lucene for indexing will be used to index
documents that each of which contains parts written in different
languages. For example some document could contain English, Chinese and
Brazilian text. So how to index such document? Is there some best
practice to do this?

What comes in my mind is to index 3 different Lucene Documents for the
real document and keep in a database the meta info that these 3
Documents are related to our real doc. For example for the myDoc.doc we
will have in the index myDocEn.doc, myDocCn.doc and myDocBr.doc and when
making search when the searched word is found in myDocCn.doc we will
visualize to user myDoc.doc. Disadvantage here is that in this case the
occurrences of the searched item will have to be recalculated. It is
important for queries like "Red NEAR/10 fox". So if someone knows better
practice than this, please let me help.

Tanks in advance,
Ivan


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Multy Language documents indexing

Reply via email to