David,

I'm doing exactly that.
And I think there's one crucial advantage aside: multilingual queries: if your user requests "segment" you have no way to know which language he is searching for; erm, well, you have the user-language(s) (through the browser Accept-Language header for example) so you'll understand he meant to search in french but would accept that he wants also matches in others languages, just less boosted.

So I "expand" the query from "segment" in a french environment to:
title-fr:segment^1.44 wor title-en:segment^1.2 ... wor text- fr:segment:1.2 wor text-en:segment:1.1 ... (wor is my naming of the weighted-or which is the normal thing of a "should" boolean query)

Surprisingly i haven't seen many people talk about "query expansion" but I think it is rather systematic and it could become more part of the culture of search engines...

paul


Le 31-mars-10 à 18:20, David Vergnaud a écrit :

The second method I've thought of is to have all languages in the same index and use different analyzers on fields that require analysis. In order to do that, I was thinking of extending the names of the fields with the names of the languages -- like e.g. "content- en" vs "content-fr" vs "content-xx" (for "no language recognized"). Then using a customized analyzer, the name of the field would be parsed in method tokenStream and the proper language-dependent analyzer would be selected. The drawback of this method, as I see it, is that the number of fields in the index increases drastically, which in turn means that building queries becomes rather cumbersome -- but still doable, assuming (which also is the case) that I know the exact list of languages I'm dealing with. Also, it means that Lucene would be searching in non-existing fields in most documents, as I doubt many of them would contain *all* languages. But it keeps the complete information about one document gathered in one place and requires searching only one index.

As I said, I've already implemented the first method some time ago and it works fine. I've only just thought about the second one when I read about this PerFieldAnalyzerWrapper, which allows to do just what I want in the second method. Since my index won't be that big at first, I doubt either architecture would prove to be much more efficient than the other, however I want to use a scaleable design right from the start, so I was wondering whether some Lucene gurus might give me some insights as to what in their eyes would be the better approach -- or whether there might be a different, much better technique I haven't thought of.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to