Re: Designing a multilingual index

Paul Libbrecht Tue, 03 Jan 2012 06:29:46 -0800

Heikki,

it does solve your main concern: a term in lucene is a pair of a token and 
field name.
The term frequency is, thus, the frequency of a token in a field.


So the term-frequency of text-stemmed-de:firewall is independent of the 
term-frequency of text-stemmed-en:firewall (for example).

But using the query expansion mechanism, it is likely that both term-queries 
will be present and both contribute to the score. Which is correct I think.

paul


Le 3 janv. 2012 à 15:06, heikki a écrit :
> 
>> The important bit is to use query-expansion.
>> Given a query of the user (with params or not, with text-queries), expand
>> it to a query where the "normal text" is expected to be in the right
>> language, but maybe also in one of the other languages (that
>> the browser says, that your platform supports), with less weight of
> course.
> 
> something like that we do now in a single index solution - results in the
> requested language are boosted enough so they're always on top
> 
> I don't think though that this addresses what is my main point: the
> frequency of terms in different domains (in this case, different languages)
> is different for each domain. This means that if the domains are chunked
> together in one index, the IDF value for a term is less "accurate" than if
> multiple, separate indexes were used. A term is more or less frequent in
> one domain or another, for a reason.. Relevance ranking is impacted by
> that, and is more accurate if separate indexes are used -- I think this
> seems logical.
> 
> I just don't know how much impact it really has, and whether it is worth to
> deal with it by presenting separate result sets from separate index
> searches ..

Re: Designing a multilingual index

Reply via email to