[ 
https://issues.apache.org/jira/browse/LUCENE-6894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Elschot updated LUCENE-6894:
---------------------------------
    Description: 
The DocIdSetIterator.cost() method returns an estimation of the number of 
matching docs. Currently conjunctions use the minimum cost, and disjunctions 
use the sum of the costs, and both are too high.

The probability of a match is estimated by dividing available cost() by the 
number of docs in a segment.

The conjunction probability is then the product of the inputs, and the 
disjunction probability follows from De Morgan's rule:
"not (A and B)" is the same as "(not A) or (not B)"
with the probability for "not" computed as 1 minus the input probability.

The independence that is assumed is normally not there. However, the cost() 
results are only used to order the input DISIs/Scorers for optimization, and 
for that I expect this assumption to work nicely.

  was:
The DocIdSetIterator.cost() method returns an estimation of the number of 
matching docs. Currently conjunctions use the minimum cost, and disjunctions 
use the sum of the costs, and both are too high.

The probability of a match is estimated by dividing available cost() by the 
number of docs in a segment.

The conjunction probability is then the product of the inputs, and the 
disjunction probability follows from De Morgan's rule:
"not (A and B)" is the same as "(not A) or (not B)"
with the probability for "not" computed as 1 minus the input probability.

The independence that is assumed is normally not there. However, for cost() 
computations only an ordering of the input DISIs/Scorers is needed, and for 
that I expect this assumption to work nicely.


> Improve DISI.cost() by assuming independence for match probabilities
> --------------------------------------------------------------------
>
>                 Key: LUCENE-6894
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6894
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/search
>            Reporter: Paul Elschot
>            Priority: Minor
>         Attachments: LUCENE-6894.patch
>
>
> The DocIdSetIterator.cost() method returns an estimation of the number of 
> matching docs. Currently conjunctions use the minimum cost, and disjunctions 
> use the sum of the costs, and both are too high.
> The probability of a match is estimated by dividing available cost() by the 
> number of docs in a segment.
> The conjunction probability is then the product of the inputs, and the 
> disjunction probability follows from De Morgan's rule:
> "not (A and B)" is the same as "(not A) or (not B)"
> with the probability for "not" computed as 1 minus the input probability.
> The independence that is assumed is normally not there. However, the cost() 
> results are only used to order the input DISIs/Scorers for optimization, and 
> for that I expect this assumption to work nicely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to