[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

Doron Cohen (Commented) (JIRA) Tue, 06 Mar 2012 00:09:27 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223077#comment-13223077
 ]


Doron Cohen commented on LUCENE-3821:
-------------------------------------

Thanks Robert, okay, I'll continue with option 2 then.

In addition, perhaps should try harder for a sloppy version of current 
ExactPhraseScorer, for both performance and correctness reasons. 

In ExactPhraseScorer, the increment of count[posIndex] is by 1, each time the 
conditions for a match (still) holds.  

A sloppy version of this, with N terms and slop=S could increment differently:
{noformat}
1 + N*S        at posIndex
1 + N*S - 1    at posIndex-1 and posIndex+1
1 + N*S - 2 at posIndex-2 and posIndex+3
...
1 + N*S - S at posIndex-S and posIndex+S
{noformat}

For S=0, this falls back to only increment by 1 and only at posIndex, same as 
the ExactPhraseScorer, which makes sense.

Also, the success criteria in ExactPhraseScorer, when checking term k, is, 
before adding up 1 for term k:
* count[posIndex] == k-1
Or, after adding up 1 for term k:
* count[posIndex] == k

In the sloppy version the criteria (after adding up term k) would be:
* count[posIndex] >= k*(1+N*S)-S

Again, for S=0 this falls to the ExactPhraseScorer logic:
* count[posIndex] >= k  

Mike (and all), correctness wise, what do you think?

If you are wondering why the increment at the actual position is (1 + N*S) - it 
allows to match only posIndexes where all terms contributed something.

I drew an example with 5 terms and slop=2 and so far it seems correct.

Also tried 2 terms and slop=5, this seems correct as well, just that, when 
there is a match, several posIndexes will contribute to the score of the same 
match. I think this is not too bad, as for these parameters, same behavior 
would be in all documents. I would be especially forgiving for this if we this 
way get some of the performance benefits of the ExactPhraseScorer.

If we agree on correctness, need to understand how to implement it, and 
consider the performance effect. The tricky part is to increment at posIndex-n. 
Say there are 3 terms in the query and one of the terms is found at indexes 10, 
15, 18. Assume the slope is 2. Since N=3, the max increment is:
- 1 + N*S = 1 + 3*2 = 7.

So the increments for this term would be (pos, incr):
{noformat}
Pos   Increment
---   ---------
 8  ,  5
 9  ,  6
10  ,  7
11  ,  6
12  ,  5
13  ,  5
14  ,  6
15  ,  7   =  max(7,5)
16  ,  6   =  max(6,5)
17  ,  6   =  max(5,6)
18  ,  7
19  ,  6
20  ,  5
{noformat}

So when we get to posIndex 17, we know that posIndex 15 contributes 5, but we 
do not know yet about the contribution of posIndex 18, which is 6, and should 
be used instead of 5. So some look-ahead (or some fix-back) is required, which 
will complicate the code.

If this seems promising, should probably open a new issue for it, just wanted 
to get some feedback first.
                
> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-3821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3821
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5, 4.0
>            Reporter: Naomi Dushay
>            Assignee: Doron Cohen
>         Attachments: LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, 
> LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this 
> case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make 
> it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

Reply via email to