Re: Designing a multilingual index

David Vergnaud Thu, 01 Apr 2010 07:42:19 -0700

Hi,

thx for sharing your experience with us. I'm happy to see that both methods 
I've thought of are apparently sensible ;-)


However, it might be due to my lack of experience in that domain, but some of 
your arguments in favor of a multi-index solution seem to me to be also 
compatible with a single index, or to apply only in a particular situation:

> Maintenance was easier this
> way; when refining/updating the settings (say 
> adding synonyms or stemmers
> for instance), you may need to reindex and smaller 
> indices allow faster deployments. 
If I understand this correctly, that is true if you're indexing distinct 
collections of documents that are all in the same language, which you know in 
advance. In that case, you would of course only need to reindex one of these 
collections into the corresponding index to update it. In my case though, there 
is no such separation and only upon processing a document can I decide what 
language it is in. I therefore cannot decide to update only documents in one 
language, it's a "all or nothing". 

> It's also "dead-easy" to add a new language (esp. compared to
> the one index solution).
I'm not really sure why adding new languages should be complicated in the 
one-index solution: the languages I can process (i.e. languages for which I 
have recognition data and the proper analyzer) are listed in a configuration 
file. Upon recognizing one of these languages, the module stores it in the 
internal data structure, which is then eventually passed to the indexing 
module. The latter retrieves the language from the structure and uses it to 
create the corresponding fields in the index dynamically (e.g. "content-de", 
"title-en" etc). It's all really automatic. 

> It also makes replication or partitioning easier.
Not sure what you mean by that... 

> Users were able to set in which language they were "fluent" (default being
> browser locale) so queries would only be performed in those and results
> "clustered" per locale (no need to return results that can not be
> understood...). Besides, IMO, scoring / ordering documents in different
> languages is a bit like comparing apples and oranges.

So if I understand correctly you'd always only search in one of the indices? In 
that case, I can understand why the multi-index solution works better for you. 
However, assuming most of my users speak at least English *and* one, two or 
three of the other languages, all searches have to be conducted in all 
languages. Besides, the various languages are not localized versions of the 
same content, they may really be anything in any language, depending on the 
person who wrote them and their mood at the time of writing. Some documents 
also contains parts in English and parts in German... I'm definitely thinking 
about making it possible to enable/disable some languages, but parallel 
searching will definitely be required. 

> Besides, IMO, scoring / ordering documents in different
> languages is a bit like comparing apples and oranges.
Not too sure about that. If for instance you were to search for "firewall" in a 
German/English index, the word may appear in documents in both languages. 
Lucene's ranking algorithm is based on the number of tokens and the number of 
occurrences in fields, and while of course the ratio may vary (e.g. German 
tends to collate several words into one single entity, resulting in one single 
token, while English rather uses phrases, resulting in several tokens), I still 
think, assuming the user can understand both, that it kind of makes sense to 
rank a short German document where "firewall" occurs 10 times higher than an 
English document where the same word occurs only 5 times. 

> Finally, query expansion can also be used in the multiple indices case and
> might even use automated/guided translation.
I do agree about that, and I guess query expansion is much easier when dealing 
with a single, consistent index structure. I'm thinking writing an algorithm to 
automatically expand a query to all languages might be ok, but I'm worried 
about the performance of performing such a query if the number of languages 
grows larger....

One other problem I'm seeing with the multi-index solution is the ranking of 
documents that are spanned across several indices (multilingual documents). If 
say my search term matches a document in the English index and the same 
document in the French index (which would quite often be the case for e.g. 
proper names), then how do I get about mixing the two rankings? (as I don't 
want to display the same result twice) I think using a single index would solve 
that problem, since the ranking would already take all fields (hence all 
languages into account). 

Cheers,

David

----- Original Message ----
From: henrib <[email protected]>
To: [email protected]
Sent: Thu, April 1, 2010 2:19:07 PM
Subject: Re: Designing a multilingual index


Hi,
I worked some time ago on a similar system (using Solr) and used the
multiple indices route (the multicore feature in Solr). In our case, the
"same" document could exist in different languages; different localized
versions of the same information (same Solr unique id for each l10n
version). 

This allowed to have the same index structure across locales but different
settings for each (synonyms, stemmers, etc). Maintenance was easier this
way; when refining/updating the settings (say adding synonyms or stemmers
for instance), you may need to reindex and smaller indices allow faster
deployments. It's also "dead-easy" to add a new language (esp. compared to
the one index solution). It also makes replication or partitioning easier.
Overall, IMO, this is a more scalable architecture than the single-index
one.

Users were able to set in which language they were "fluent" (default being
browser locale) so queries would only be performed in those and results
"clustered" per locale (no need to return results that can not be
understood...). Besides, IMO, scoring / ordering documents in different
languages is a bit like comparing apples and oranges.

Finally, query expansion can also be used in the multiple indices case and
might even use automated/guided translation.

In my experience, multiple indices had many advantages over the single index
solution, be them functional or operational. YMMV.
Hope this helps,
Henrib

-- 
View this message in context: 
http://n3.nabble.com/Designing-a-multilingual-index-tp688766p690625.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


      

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Designing a multilingual index

Reply via email to