David,
I'm doing exactly that.
And I think there's one crucial advantage aside: multilingual queries:
if your user requests "segment" you have no way to know which language
he is searching for; erm, well, you have the user-language(s) (through
the browser Accept-Language header for example) so you'll understand
he meant to search in french but would accept that he wants also
matches in others languages, just less boosted.
So I "expand" the query from "segment" in a french environment to:
title-fr:segment^1.44 wor title-en:segment^1.2 ... wor text-
fr:segment:1.2 wor text-en:segment:1.1 ...
(wor is my naming of the weighted-or which is the normal thing of a
"should" boolean query)
Surprisingly i haven't seen many people talk about "query expansion"
but I think it is rather systematic and it could become more part of
the culture of search engines...
paul
Le 31-mars-10 à 18:20, David Vergnaud a écrit :
The second method I've thought of is to have all languages in the
same index and use different analyzers on fields that require
analysis. In order to do that, I was thinking of extending the names
of the fields with the names of the languages -- like e.g. "content-
en" vs "content-fr" vs "content-xx" (for "no language recognized").
Then using a customized analyzer, the name of the field would be
parsed in method tokenStream and the proper language-dependent
analyzer would be selected.
The drawback of this method, as I see it, is that the number of
fields in the index increases drastically, which in turn means that
building queries becomes rather cumbersome -- but still doable,
assuming (which also is the case) that I know the exact list of
languages I'm dealing with. Also, it means that Lucene would be
searching in non-existing fields in most documents, as I doubt many
of them would contain *all* languages. But it keeps the complete
information about one document gathered in one place and requires
searching only one index.
As I said, I've already implemented the first method some time ago
and it works fine. I've only just thought about the second one when
I read about this PerFieldAnalyzerWrapper, which allows to do just
what I want in the second method. Since my index won't be that big
at first, I doubt either architecture would prove to be much more
efficient than the other, however I want to use a scaleable design
right from the start, so I was wondering whether some Lucene gurus
might give me some insights as to what in their eyes would be the
better approach -- or whether there might be a different, much
better technique I haven't thought of.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org