[Lucene-dev] Re: Lucene-dev digest, Vol 1 #124 - 16 msgs

Dmitry Serebrennikov Mon, 18 Jun 2001 10:34:55 -0700
Doug, what would be the best way to handle cross-language indexing and searching?
The issue is that when indexing web sites or intranet documents one might come acorss 
documents in different languages. Assuming that the language can be detected, and an 
analyzer for that language is available, one could then create documents that have 
have fields of the form: search_<LANG>, where <LANG> is a Java Locale code (or 
something similar).

When a query is constructed, it can be expanded to take a query in the language of the 
user and "translate" it term-by-term using a dictionary lookup. Then create an OR-ed 
query where a query component in a given language is done against a correspondingly 
named field. 

What do you think of this approach? Is there a better way?
It seems that this would bypass the single analyzer limitation you mentioned since the 
analysis is done by custom code before the query is submitted (and by other custom 
code during indexing). Am I right on this one?

Dmitry.


==================================================
Given lucene only supports one analyzer per index, the latter seems like
what's needed.

Another approach is to change lucene's index to track which fields were
tokenized and which weren't.  This would be fairly easy to add.  Then you
could simply pass in the IndexReader to the query parser and not analyze
untokenized fields.  If that sounds like a sufficient solution, then I would
be willing to add tracking of which fields are tokenized to the indexing
code.

Doug


_______________________________________________
Lucene-dev mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/lists/listinfo/lucene-dev
[Lucene-dev] Re: Lucene-dev digest, Vol 1 #124 - 16 msgs

Reply via email to