Re: Designing a multilingual index

Paul Libbrecht Wed, 31 Mar 2010 13:00:47 -0700

David,

I'm doing exactly that.

And I think there's one crucial advantage aside: multilingual queries:if your user requests "segment" you have no way to know which languagehe is searching for; erm, well, you have the user-language(s) (throughthe browser Accept-Language header for example) so you'll understandhe meant to search in french but would accept that he wants alsomatches in others languages, just less boosted.


So I "expand" the query from "segment" in a french environment to:

title-fr:segment^1.44 wor title-en:segment^1.2 ... wor text-fr:segment:1.2 wor text-en:segment:1.1 ...(wor is my naming of the weighted-or which is the normal thing of a"should" boolean query)

Surprisingly i haven't seen many people talk about "query expansion"but I think it is rather systematic and it could become more part ofthe culture of search engines...


paul


Le 31-mars-10 à 18:20, David Vergnaud a écrit :

The second method I've thought of is to have all languages in thesame index and use different analyzers on fields that requireanalysis. In order to do that, I was thinking of extending the namesof the fields with the names of the languages -- like e.g. "content-en" vs "content-fr" vs "content-xx" (for "no language recognized").Then using a customized analyzer, the name of the field would beparsed in method tokenStream and the proper language-dependentanalyzer would be selected.The drawback of this method, as I see it, is that the number offields in the index increases drastically, which in turn means thatbuilding queries becomes rather cumbersome -- but still doable,assuming (which also is the case) that I know the exact list oflanguages I'm dealing with. Also, it means that Lucene would besearching in non-existing fields in most documents, as I doubt manyof them would contain *all* languages. But it keeps the completeinformation about one document gathered in one place and requiressearching only one index.
As I said, I've already implemented the first method some time agoand it works fine. I've only just thought about the second one whenI read about this PerFieldAnalyzerWrapper, which allows to do justwhat I want in the second method. Since my index won't be that bigat first, I doubt either architecture would prove to be much moreefficient than the other, however I want to use a scaleable designright from the start, so I was wondering whether some Lucene gurusmight give me some insights as to what in their eyes would be thebetter approach -- or whether there might be a different, muchbetter technique I haven't thought of.



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Designing a multilingual index

Reply via email to