Shayan Tabrizi created LUCENE-7480:
--------------------------------------
Summary: Wrong Formula in LMDirichletSimilarity
Key: LUCENE-7480
URL: https://issues.apache.org/jira/browse/LUCENE-7480
Project: Lucene - Core
Issue Type: Bug
Reporter: Shayan Tabrizi
It seems that LMDirichletSimilarity only calculates "score" method if the term
occurs in the document. Otherwise, in line 389 of BooleanWeight (Lucene 6.2.0)
subScorer becomes null, and thus the clause is not added to the optional list
in order to be scored.
However, in the original formula of LM
(http://www.stat.uchicago.edu/~lafferty/pdf/smooth-tois.pdf, formula 6), we
have "n log a_d" (n is the number of query terms). Therefore, even for the
query terms not present in the document a "log a_d" must be added to the final
score.
But the implementation of LMDirichletSimilarity adds "log a_d" to the score in
the "score" method, and therefore it is only added to the final score for the
query terms present in the document.
This can worsen the retrieval results compared to the correct formula. I tried
to correct this for myself but because of the plenty of "final" methods and
classes, I was not successful. Please, check the problem and solve it if
approved, and also please tell me how I can correct it before a new release is
published.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]