[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

Robert Muir (Commented) (JIRA) Tue, 06 Mar 2012 16:24:24 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223856#comment-13223856
 ]


Robert Muir commented on LUCENE-3821:
-------------------------------------

{quote}
To prevent two repeating terms from landing on the same document position, we 
propagate the smaller of them (smaller in its phrase-position, which takes into 
account both the doc-position and the offset of that term in the query).

Without this special treatment, a phrase query "a b a"~2 might match a document 
"a b", because both "a"'s (query terms) will land on the same document's "a". 
This is illegal and is prevented by such propagation.

But when one of the repeating terms is a multi-term, it is not possible to know 
which of the repeating terms to propagate. This is the unsolved bug.
{quote}

Not understanding really how SloppyPhraseScorer works now, but not trying to 
add confusion to the issue, I can't help but think
this problem is a variant on LevensteinAutomata... in fact that was the 
motivation for the new test, i just stole the testing
methodology from there and applied it to this!

It seems many things are the same but with a few twists:
* fundamentally we are interleaving the streams from the subscorers into the 
'index automaton'
* 'query automaton' is produced from the user-supplied terms
* our 'alphabet' is the terms, and holes from position increment are just an 
additional symbol.
* just like the LevensteinAutomata case, repeats are problematic because they 
are different characteristic vectors
* stacked terms at the same position (index or query) just make the automata 
more complex (so they arent just strings)

I'm not suggesting we try to re-use any of that code at all, i don't think it 
will work. But I wonder if we can re-use even
some of the math to redefine the problem more formally to figure out what 
minimal state/lookahead we need for example...

                
> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-3821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3821
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5, 4.0
>            Reporter: Naomi Dushay
>            Assignee: Doron Cohen
>         Attachments: LUCENE-3821-SloppyDecays.patch, LUCENE-3821.patch, 
> LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, 
> LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this 
> case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make 
> it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

Reply via email to