Adrien Grand created LUCENE-8218:
------------------------------------

             Summary: Good default weight for static scoring signals?
                 Key: LUCENE-8218
                 URL: https://issues.apache.org/jira/browse/LUCENE-8218
             Project: Lucene - Core
          Issue Type: Improvement
            Reporter: Adrien Grand


On LUCENE-8197, the question was raised whether we could come up with a good 
default value for the weight of a static scoring factor into the final score, 
which would make the functionality much easier to use. Currently it is 1.

One question that looks open for instance is whether these weights should be 
the same for all queries or not. Some papers said yes (they typically 
normalized query-dependent scores rather than scaling the weight based on the 
query-dependent scores, but this has the same effect in the end) while others, 
eg. the paper that LUCENE-8197 is based on, just used a static weight for all 
queries. In both cases, optimal values for the weight were computed via 
training.

Another question is whether we should make the default weight depend on the 
similarity that is being used.

In the end, there is also a possibility that 1 is not a bad default at all. For 
instance if the weight of a term is log(x) where x is a fraction like 
df/docCount or ttf/sumTtf then it means that the static scoring factor has a 
weight that is the same as a term that appears in about 1/e ~ 37% of the 
corpus. In the particular case of BM25, it's actually closer to 1/(1+e) ~ 27%. 
The more I think about this issue, the more I'm erring on that side but I'd be 
curious to hear other opinions on this topic.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to