Hi, thanks Paul for your input. I'm gonna try the "localized field" variant and see how it works for me.
I think your idea of automatically boosting the user language is neat, but it should definitely be possible to disable this boosting... Most users have no idea about the language settings in their browser, which drive the contents of the "Accept-Language" header, and e.g. here in Switzerland there's many a foreigner whose prefered language is not French or German or Italian, so forcing a boost on the user could definitely result in a poor user experience. Does anyone have any technical arguments why the one (several indices) or the other (localized fields in a single index) method might be better? Cheers, David ----- Original Message ---- From: Paul Libbrecht <p...@activemath.org> To: java-user@lucene.apache.org Sent: Wed, March 31, 2010 10:00:14 PM Subject: Re: Designing a multilingual index David, I'm doing exactly that. And I think there's one crucial advantage aside: multilingual queries: if your user requests "segment" you have no way to know which language he is searching for; erm, well, you have the user-language(s) (through the browser Accept-Language header for example) so you'll understand he meant to search in french but would accept that he wants also matches in others languages, just less boosted. So I "expand" the query from "segment" in a french environment to: title-fr:segment^1.44 wor title-en:segment^1.2 ... wor text-fr:segment:1.2 wor text-en:segment:1.1 ... (wor is my naming of the weighted-or which is the normal thing of a "should" boolean query) Surprisingly i haven't seen many people talk about "query expansion" but I think it is rather systematic and it could become more part of the culture of search engines... paul Le 31-mars-10 à 18:20, David Vergnaud a écrit : > The second method I've thought of is to have all languages in the same index > and use different analyzers on fields that require analysis. In order to do > that, I was thinking of extending the names of the fields with the names of > the languages -- like e.g. "content-en" vs "content-fr" vs "content-xx" (for > "no language recognized"). Then using a customized analyzer, the name of the > field would be parsed in method tokenStream and the proper language-dependent > analyzer would be selected. > The drawback of this method, as I see it, is that the number of fields in the > index increases drastically, which in turn means that building queries > becomes rather cumbersome -- but still doable, assuming (which also is the > case) that I know the exact list of languages I'm dealing with. Also, it > means that Lucene would be searching in non-existing fields in most > documents, as I doubt many of them would contain *all* languages. But it > keeps the complete information about one document gathered in one place and > requires searching only one index. > > As I said, I've already implemented the first method some time ago and it > works fine. I've only just thought about the second one when I read about > this PerFieldAnalyzerWrapper, which allows to do just what I want in the > second method. Since my index won't be that big at first, I doubt either > architecture would prove to be much more efficient than the other, however I > want to use a scaleable design right from the start, so I was wondering > whether some Lucene gurus might give me some insights as to what in their > eyes would be the better approach -- or whether there might be a different, > much better technique I haven't thought of. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org