Adrien Grand created LUCENE-8218:
------------------------------------
Summary: Good default weight for static scoring signals?
Key: LUCENE-8218
URL: https://issues.apache.org/jira/browse/LUCENE-8218
Project: Lucene - Core
Issue Type: Improvement
Reporter: Adrien Grand
On LUCENE-8197, the question was raised whether we could come up with a good
default value for the weight of a static scoring factor into the final score,
which would make the functionality much easier to use. Currently it is 1.
One question that looks open for instance is whether these weights should be
the same for all queries or not. Some papers said yes (they typically
normalized query-dependent scores rather than scaling the weight based on the
query-dependent scores, but this has the same effect in the end) while others,
eg. the paper that LUCENE-8197 is based on, just used a static weight for all
queries. In both cases, optimal values for the weight were computed via
training.
Another question is whether we should make the default weight depend on the
similarity that is being used.
In the end, there is also a possibility that 1 is not a bad default at all. For
instance if the weight of a term is log(x) where x is a fraction like
df/docCount or ttf/sumTtf then it means that the static scoring factor has a
weight that is the same as a term that appears in about 1/e ~ 37% of the
corpus. In the particular case of BM25, it's actually closer to 1/(1+e) ~ 27%.
The more I think about this issue, the more I'm erring on that side but I'd be
curious to hear other opinions on this topic.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]