Re: Designing a multilingual index

David Vergnaud Thu, 01 Apr 2010 01:14:33 -0700

Hi,

thanks Paul for your input. I'm gonna try the "localized field" variant and see 
how it works for me.


I think your idea of automatically boosting the user language is neat, but it 
should definitely be possible to disable this boosting... Most users have no 
idea about the language settings in their browser, which drive the contents of 
the "Accept-Language" header, and e.g. here in Switzerland there's many a 
foreigner whose prefered language is not French or German or Italian, so 
forcing a boost on the user could definitely result in a poor user experience. 

Does anyone have any technical arguments why the one (several indices) or the 
other (localized fields in a single index) method might be better? 

Cheers,

David



----- Original Message ----
From: Paul Libbrecht <[email protected]>
To: [email protected]
Sent: Wed, March 31, 2010 10:00:14 PM
Subject: Re: Designing a multilingual index

David,

I'm doing exactly that.
And I think there's one crucial advantage aside: multilingual queries: if your 
user requests "segment" you have no way to know which language he is searching 
for; erm, well, you have the user-language(s) (through the browser 
Accept-Language header for example) so you'll understand he meant to search in 
french but would accept that he wants also matches in others languages, just 
less boosted.

So I "expand" the query from "segment" in a french environment to:
  title-fr:segment^1.44 wor title-en:segment^1.2 ... wor text-fr:segment:1.2 
wor text-en:segment:1.1 ...
(wor is my naming of the weighted-or which is the normal thing of a "should" 
boolean query)

Surprisingly i haven't seen many people talk about "query expansion" but I 
think it is rather systematic and it could become more part of the culture of 
search engines...

paul


Le 31-mars-10 à 18:20, David Vergnaud a écrit :

> The second method I've thought of is to have all languages in the same index 
> and use different analyzers on fields that require analysis. In order to do 
> that, I was thinking of extending the names of the fields with the names of 
> the languages -- like e.g. "content-en" vs "content-fr" vs "content-xx" (for 
> "no language recognized"). Then using a customized analyzer, the name of the 
> field would be parsed in method tokenStream and the proper language-dependent 
> analyzer would be selected.
> The drawback of this method, as I see it, is that the number of fields in the 
> index increases drastically, which in turn means that building queries 
> becomes rather cumbersome -- but still doable, assuming (which also is the 
> case) that I know the exact list of languages I'm dealing with. Also, it 
> means that Lucene would be searching in non-existing fields in most 
> documents, as I doubt many of them would contain *all* languages. But it 
> keeps the complete information about one document gathered in one place and 
> requires searching only one index.
> 
> As I said, I've already implemented the first method some time ago and it 
> works fine. I've only just thought about the second one when I read about 
> this PerFieldAnalyzerWrapper, which allows to do just what I want in the 
> second method. Since my index won't be that big at first, I doubt either 
> architecture would prove to be much more efficient than the other, however I 
> want to use a scaleable design right from the start, so I was wondering 
> whether some Lucene gurus might give me some insights as to what in their 
> eyes would be the better approach -- or whether there might be a different, 
> much better technique I haven't thought of.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]




---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Designing a multilingual index

Reply via email to