[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

Doron Cohen (Commented) (JIRA) Sun, 04 Mar 2012 05:00:26 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221879#comment-13221879
 ]


Doron Cohen commented on LUCENE-3821:
-------------------------------------

I think I understand the cause.

In current implementation there is an assumption that once we landed on the 
first candidate document, it is possible to check if there are repeating pps, 
by just comparing the in-doc-positions of the terms. 

Tricky as it is, while this is true for plain PhrasePositions, it is not true 
for MultiPhrasePositions - assume to MPPs: (a m n) and (b x y), and first 
candidate document that starts with "a b". The in-doc-positions of the two pps 
would be 0,1 respectively (for 'a' and 'b') and we would not even detect the 
fact that there are repetitions, not to mention not putting them in the same 
group.

MPPs conflicts with current patch in an additional manner: It is now assumed 
that each repetition can be assigned a repetition group. 

So assume these PPs (and query positions): 
0:a 1:b 3:a 4:b 7:c
There are clearly two repetition groups {0:a, 3:a} and {1:b, 4:b}, 
while 7:c is not a repetition.

But assume these PPs (and query positions): 
0:(a b) 1:(b x) 3:a 4:b 7:(c x)
We end up with a single large repetition group:
{0:(a b) 1:(b x) 3:a 4:b 7:(c x)}

I think if the groups are created correctly at the first candidate document, 
scorer logic would still work, as a collision is decided only when two pps are 
in the same in-doc-position. The only impact of MPPs would be performance cost: 
since repetition groups are larger, it would take longer to check if there are 
repetitions.

Just need to figure out how to detect repetition groups without relying on 
in-(first-)doc-positions.
                
> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-3821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3821
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5, 4.0
>            Reporter: Naomi Dushay
>            Assignee: Doron Cohen
>         Attachments: LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, 
> LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this 
> case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make 
> it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

Reply via email to