Thanks all for the useful comments. It seems that there are even more options --
4/ One index, with a separate Lucene document for each (item,language) combination, with one field that specifies the language 5/ One index, one Lucene document per item, with field names that include the language (e.g. title_en, title_cn) I quite like 4, because you can search with no language constraint, or with one as Paul suggests below. However, some "non language-specific" data might need to be repeated (e.g. dates), unless we had an extra Lucene document for all that. I wonder what the various pros and cons in terms of index size and performance would be in each case? I really don't have enough knowledge of Lucene to have any idea... Robert Tansley / Digital Media Systems Programme / HP Labs http://www.hpl.hp.com/personal/Robert_Tansley/ > -----Original Message----- > From: Paul Libbrecht [mailto:[EMAIL PROTECTED] > Sent: 01 June 2005 04:10 > To: java-user@lucene.apache.org > Subject: Re: Indexing multiple languages > > Le 1 juin 05, à 01:12, Erik Hatcher a écrit : > >> 1/ one index for all languages > >> 2/ one index for all languages, with an extra language field so > >> searches > >> can be constrained to a particular language > >> 3/ separate indices for each language? > > I would vote for option #2 as it gives the most flexibilty > - you can > > query with or without concern for language. > > The way I've solved this is to make a different field-name > per-language > as our documents can be multilingual. > What's then done is query expansion at query time: given a term-query > for text, I duplicate it for each accepted language of the > user with a > factor related to the preference of the language (e.g. the q > factor in > Accept-Language http header). Presumably I could be using solution 2/ > as well if my queries become too big, making several > documents for each > language of the document. > > I think it's very important to care about guessing the accepted > languages of the user. Typically, the default behaviour of > Google is to > only give you matches in your primary language but then allow > expansion > in any language. > > >> On the other hand, if people are searching for proper nouns in > >> metadata > >> (e.g. "DSpace") it may be advantageous to search all languages at > >> once. > > This one may need particular treatment. > > Tell us your success! > > paul > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]