[ 
https://issues.apache.org/jira/browse/LUCENE-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13083080#comment-13083080
 ] 

David Mark Nemeskey commented on LUCENE-3357:
---------------------------------------------

Apparently the Dirichlet method returns a negative score if the tf / docLen < 
corpusTf / corpusLen. Unfortunately the negative number can be arbitrarily 
large, so it's not as easy as adding a constant to the score. This of course 
makes sense if all documents are scored, as the function is monotone and 
consequently documents, whose tf is 0, will always be ranked lower than those 
that contain the word. But this is not how IR engines work.

Having said that, I believe that we could simulate such a system. I don't know 
exactly how the query architecture works, but I presume the clauses that don't 
match a document are assigned a zero value. Now instead of this zero, the 
Scorer (or whatever class does this) could ask for a default value from the 
Similarity. In this case LMDirichletSimilarity could return score(stats, 0, 
Integer.MAX_VALUE), which is somewhere around -12.

If we don't do this, we have three options:
1. add score(stats, 0, Integer.MAX_VALUE) to the score
2. if (score < 0) return 0
3. add corpusTf / corpusLen * docLen to tf

All ensure a positive score, but also each has its own disadvantage.
1. adds a pretty big constant to the score, which may not play well with the 
other parts of the query
2. some documents that contain the term get the same 0 score as documents that 
don't (though I cannot say this is not in line with the LM approach)
3. this introduces a transformation that is difficult to characterize

For the time being, I'll go with 2, but we have to discuss this.

> Unit and integration test cases for the new Similarities
> --------------------------------------------------------
>
>                 Key: LUCENE-3357
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3357
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/query/scoring
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>            Priority: Minor
>              Labels: gsoc, gsoc2011, test
>             Fix For: flexscoring branch
>
>         Attachments: LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, 
> LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, LUCENE-3357.patch, 
> LUCENE-3357.patch, LUCENE-3357.patch
>
>
> Write test cases to test the new Similarities added in 
> [LUCENE-3220|https://issues.apache.org/jira/browse/LUCENE-3220]. Two types of 
> test cases will be created:
>  * unit tests, in which mock statistics are provided to the Similarities and 
> the score is validated against hand calculations;
>  * integration tests, in which a small collection is indexed and then 
> searched using the Similarities.
> Performance tests will be performed in a separate issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to