[jira] [Commented] (LUCENE-7480) Wrong Formula in LMDirichletSimilarity

Michael McCandless (JIRA) Fri, 07 Oct 2016 07:44:53 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15555263#comment-15555263
 ]


Michael McCandless commented on LUCENE-7480:
--------------------------------------------

I am not familiar with {{LMDirichletSimilarity}} in particular, but there are 
two phases in general for a similarity.

Phase 1 is done up front by checking the term statistics across the entire 
index, in {{LMSimilarity.fillBasicStats}}.

Phase 2 is done per-segment, which is the code you are pointing to in 
{{BooleanWeight}}: when {{subScorer}} is {{null}} that means the requested term 
(or sub-query) never appears at all in the current segment.  But this is not 
supposed to alter how scoring works, since Phase 1 should have computed stats 
for all terms in the query.

Maybe you can make a test case showing that the score is incorrect in Lucene's 
implementation vs the original formula?

> Wrong Formula in LMDirichletSimilarity
> --------------------------------------
>
>                 Key: LUCENE-7480
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7480
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Shayan Tabrizi
>
> It seems that LMDirichletSimilarity only calculates "score" method if the 
> term occurs in the document. Otherwise, in line 389 of BooleanWeight (Lucene 
> 6.2.0) subScorer becomes null, and thus the clause is not added to the 
> optional list in order to be scored.
> However, in the original formula of LM 
> (http://www.stat.uchicago.edu/~lafferty/pdf/smooth-tois.pdf, formula 6), we 
> have "n log a_d" (n is the number of query terms). Therefore, even for the 
> query terms not present in the document a "log a_d" must be added to the 
> final score.
> But the implementation of LMDirichletSimilarity adds "log a_d" to the score 
> in the "score" method, and therefore it is only added to the final score for 
> the query terms present in the document.
> This can worsen the retrieval results compared to the correct formula. I 
> tried to correct this for myself but because of the plenty of "final" methods 
> and classes, I was not successful. Please, check the problem and solve it if 
> approved, and also please tell me how I can correct it before a new release 
> is published.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-7480) Wrong Formula in LMDirichletSimilarity

Reply via email to