RE: Best Practices for Multiple Languages

George Aroush Wed, 29 Aug 2007 18:12:56 -0700

Hi Douglas,

Defiantly use one Lucene index per language.  This will give you the
simplicity of maintaining separate indexes per langue so you can manage them
as such and better performance per langue since per langue index will be
much smaller then one Lucene index holding all of your data.


If in the feature you need to search across multiple languages, just use a
MultiSearcher.

Regards,

-- George

> -----Original Message-----
> From: Douglas Smith (DataSmithy) [mailto:[EMAIL PROTECTED] 
> Sent: Tuesday, August 28, 2007 11:46 PM
> To: [email protected]
> Subject: Best Practices for Multiple Languages
> 
> Hi All,
> 
> That you all for you comments earlier on using Lucene as a 
> web service.
> I am looking at Solr, and it does have some potential for my 
> application.
> 
> In the meantime I have this question: What are the 
> recommended best practices for using Lucene to index multiple 
> languages? Particularly, would I be better off using a 
> separate index for each language?
> 
> Here is our scenario: We have a database that has several 
> text fields that will be translated to multiple languages. 
> The text will be only be incrementally translated, and it 
> could take years to get it completely translated. Also, new 
> untranslated data is always being added. Also, we may add new 
> languages to be translated at any time. When a user selects 
> to view our web application in a foreign language, we want 
> the user to be able to search in either their language, or in 
> English (in order to guarantee that they can find all data). 
> I probably won't know which language they actually entered 
> for the text search. I want search Lucene in both the English 
> and the selected language, and return any results that are found.
> 
> FYI, I will be using Lucene to return a list of IDs that are 
> unique to our data, and then joining back to our data, using 
> SQL. I will use our database to show a mix of translated and 
> untranslated data. That is, translated data/fields are show 
> if we have it, otherwise the default English is shown. So I 
> don't need to get the text itself from Lucene, just a list of 
> ID's that I can use in my SQL query. I can pull out our data 
> easily in either language, or a mix, in order to create 
> Lucene indexes.
> 
> If I can mix languages in a single index, I would like to add 
> a Language column to query on, and query on both the english 
> and the foreign language text. If not, I can see it working 
> to query to run two seperate Lucene queries on two seperate 
> indexes, and combining the resulting ID list into a single 
> list (and making it unique, if needed).
> 
> If you have any comments, or feedback from experience doing 
> anything like this, it would be much appreciated!
> 
> Douglas Smith
>

RE: Best Practices for Multiple Languages

Reply via email to