Best Practices for Multiple Languages

Douglas Smith (DataSmithy) Tue, 28 Aug 2007 20:47:53 -0700

Hi All,

That you all for you comments earlier on using Lucene as a web service.
I am looking at Solr, and it does have some potential for my application.


In the meantime I have this question: What are the recommended best
practices for using Lucene to index multiple languages? Particularly,
would I be better off using a separate index for each language?

Here is our scenario: We have a database that has several text fields
that will be translated to multiple languages. The text will be only be
incrementally translated, and it could take years to get it completely
translated. Also, new untranslated data is always being added. Also, we
may add new languages to be translated at any time. When a user selects
to view our web application in a foreign language, we want the user to
be able to search in either their language, or in English (in order to
guarantee that they can find all data). I probably won't know which
language they actually entered for the text search. I want search Lucene
in both the English and the selected language, and return any results
that are found.

FYI, I will be using Lucene to return a list of IDs that are unique to
our data, and then joining back to our data, using SQL. I will use our
database to show a mix of translated and untranslated data. That is,
translated data/fields are show if we have it, otherwise the default
English is shown. So I don't need to get the text itself from Lucene,
just a list of ID's that I can use in my SQL query. I can pull out our
data easily in either language, or a mix, in order to create Lucene
indexes.

If I can mix languages in a single index, I would like to add a Language
column to query on, and query on both the english and the foreign
language text. If not, I can see it working to query to run two seperate
Lucene queries on two seperate indexes, and combining the resulting ID
list into a single list (and making it unique, if needed).

If you have any comments, or feedback from experience doing anything
like this, it would be much appreciated!

Douglas Smith

Best Practices for Multiple Languages

Reply via email to