[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

Doron Cohen (Commented) (JIRA) Tue, 06 Mar 2012 13:56:23 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223700#comment-13223700
 ]


Doron Cohen commented on LUCENE-3821:
-------------------------------------

I'm afraid it won't solve the problem.

The complicity of SloppyPhraseScorer stems firstly from the slop.
That part is handled in the scorer for long time.

Two additional complications are repeating terms, and multi-term phrases.
Each one of these, separately, is handled as well.
Their combination however, is the cause for this discussion.

To prevent two repeating terms from landing on the same document position, we 
propagate the smaller of them (smaller in its phrase-position, which takes into 
account both the doc-position and the offset of that term in the query).

Without this special treatment, a phrase query "a b a"~2 might match a document 
"a b", because both "a"'s (query terms) will land on the same document's "a". 
This is illegal and is prevented by such propagation. 

But when one of the repeating terms is a multi-term, it is not possible to know 
which of the repeating terms to propagate. This is the unsolved bug.

Now, back to current ExactPhraseScorer.
It does not have this problem with repeating terms.
But not because of the different algorithm - rather because of the different 
scenario.
It does not have this problem because exact phrase scoring does not have it.
In exact phrase scoring, a match is declared only when all PPs are in the same 
phrase position.
Recall that phrase position = doc-position - query-offset, it is visible that 
when two PPs with different query offset are in the same phrase-position, their 
doc-position cannot be the same, and therefore no special handling is needed 
for repeating terms in exact phrase scorers.

However, once we will add that slopy-decaying frequency, we will match in 
certain posIndex, different phrase-positions. This is because of the slop. So 
they might land on the same doc-position, and then we start again...

This is really too bad. Sorry for the lengthy post, hopefully this would help 
when someone wants to get into this.

Back to option 2.
                
> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-3821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3821
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5, 4.0
>            Reporter: Naomi Dushay
>            Assignee: Doron Cohen
>         Attachments: LUCENE-3821-SloppyDecays.patch, LUCENE-3821.patch, 
> LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, 
> LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this 
> case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make 
> it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

Reply via email to