Adrien Grand created LUCENE-8563:
------------------------------------

             Summary: Remove k1+1 from the numerator of  BM25Similarity
                 Key: LUCENE-8563
                 URL: https://issues.apache.org/jira/browse/LUCENE-8563
             Project: Lucene - Core
          Issue Type: Improvement
            Reporter: Adrien Grand


Our current implementation of BM25 does
{code:java}
boost * IDF * (k1+1) * tf / (tf + norm)
{code}
As (k1+1) is a constant, it is the same for every term and doesn't modify 
ordering. It is often omitted and I found out that the "The Probabilistic 
Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and 
Zaragova even describes adding (k1+1) to the numerator as a variant whose 
benefit is to be more comparable with Robertson/Sparck-Jones weighting, which 
we don't care about.
{quote}A common variant is to add a (k1 + 1) component to the
 numerator of the saturation function. This is the same for all
 terms, and therefore does not affect the ranking produced.
 The reason for including it was to make the final formula
 more compatible with the RSJ weight used on its own
{quote}
Should we remove it from BM25Similarity as well?

A side-effect that I'm interested in is that integrating other score 
contributions (eg. via oal.document.FeatureField) would be a bit easier to 
reason about. For instance a weight of 3 in FeatureField#newSaturationQuery 
would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) 
rather than a term whose IDF is 3/(k1 + 1).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to