Re: Designing a multilingual index

Paul Libbrecht Tue, 03 Jan 2012 07:11:29 -0800

I think the idf is also about terms and not about tokens.
Maybe an expert can confirm my belief or we have to invent a test.


paul


Le 3 janv. 2012 à 15:43, heikki a écrit :

> hi Paul,
> 
> yes, but my concern isn't about the term-frequency, but rather the
> inverted-document-frequency, which also is used in the relevance score and
> which takes into account all documents in the index.. in this way the
> relevance score of one document is influenced by the contents of all other
> documents that are in the same index. This is why it seems logical to me
> that if different domains use separate indexes, the relevance scoring is
> more accurate.
> 
> 
> Kind regards,
> Heikki Doeleman
> 
> 
> 
> 
> On Tue, Jan 3, 2012 at 3:29 PM, Paul Libbrecht <[email protected]> wrote:
> 
>> Heikki,
>> 
>> it does solve your main concern: a term in lucene is a pair of a token and
>> field name.
>> The term frequency is, thus, the frequency of a token in a field.
>> 
>> So the term-frequency of text-stemmed-de:firewall is independent of the
>> term-frequency of text-stemmed-en:firewall (for example).
>> 
>> But using the query expansion mechanism, it is likely that both
>> term-queries will be present and both contribute to the score. Which is
>> correct I think.
>> 
>> paul
>> 
>> 
>> Le 3 janv. 2012 à 15:06, heikki a écrit :
>>> 
>>>> The important bit is to use query-expansion.
>>>> Given a query of the user (with params or not, with text-queries),
>> expand
>>>> it to a query where the "normal text" is expected to be in the right
>>>> language, but maybe also in one of the other languages (that
>>>> the browser says, that your platform supports), with less weight of
>>> course.
>>> 
>>> something like that we do now in a single index solution - results in the
>>> requested language are boosted enough so they're always on top
>>> 
>>> I don't think though that this addresses what is my main point: the
>>> frequency of terms in different domains (in this case, different
>> languages)
>>> is different for each domain. This means that if the domains are chunked
>>> together in one index, the IDF value for a term is less "accurate" than
>> if
>>> multiple, separate indexes were used. A term is more or less frequent in
>>> one domain or another, for a reason.. Relevance ranking is impacted by
>>> that, and is more accurate if separate indexes are used -- I think this
>>> seems logical.
>>> 
>>> I just don't know how much impact it really has, and whether it is worth
>> to
>>> deal with it by presenting separate result sets from separate index
>>> searches ..
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Designing a multilingual index

Reply via email to