[jira] [Updated] (LUCENE-3412) SloppyPhraseScorer returns non-deterministic results for queries with many repeats

Doron Cohen (JIRA) Wed, 07 Sep 2011 05:05:44 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-3412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Doron Cohen updated LUCENE-3412:
--------------------------------

    Attachment: LUCENE-3412.patch

Attached patch with fix to this bug.

The fix is rather simple, - just process PP's in offset order. That is, when 
avoiding conflicts (a conflict means: more than a single query PP is landing on 
the same doc TP), make sure to handle PPs in a specific order: from first in 
query to last in query. 

This is crucial because the check for conflicts returns the PP with greater 
offset, and that one is advanced.

It was pretty quick to fix this, but took longer to justify the fix.

I added some explanations in the code so that next time justification would be 
faster :) and also renamed termPositionsDiffer() to termPositionsConflict() 
which more accurately describes the logic of that method.

now need to see if this fix is also related to LUCENE-3215.

> SloppyPhraseScorer returns non-deterministic results for queries with many 
> repeats
> ----------------------------------------------------------------------------------
>
>                 Key: LUCENE-3412
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3412
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/search
>    Affects Versions: 3.1, 3.2, 3.3, 4.0
>            Reporter: Michael Ryan
>            Assignee: Doron Cohen
>         Attachments: LUCENE-3412.patch, LUCENE-3412.patch
>
>
> Proximity queries with many repeats (four or more, based on my testing) 
> return non-deterministic results. I run the same query multiple times with 
> the same data set and get different results.
> So far I've reproduced this with Solr 1.4.1, 3.1, 3.2, 3.3, and latest 4.0 
> trunk.
> Steps to reproduce (using the Solr example):
> 1) In solrconfig.xml, set queryResultCache size to 0.
> 2) Add some documents with text "dog dog dog" and "dog dog dog dog". 
> http://localhost:8983/solr/update?stream.body=%3Cadd%3E%3Cdoc%3E%3Cfield%20name=%22id%22%3E1%3C/field%3E%3Cfield%20name=%22text%22%3Edog%20dog%20dog%3C/field%3E%3C/doc%3E%3Cdoc%3E%3Cfield%20name=%22id%22%3E2%3C/field%3E%3Cfield%20name=%22text%22%3Edog%20dog%20dog%20dog%3C/field%3E%3C/doc%3E%3C/add%3E&commit=true
> 3) Do a "dog dog dog dog"~1 query. 
> http://localhost:8983/solr/select?q=%22dog%20dog%20dog%20dog%22~1
> 4) Repeat step 3 many times.
> Expected results: The document with id 2 should be returned.
> Actual results: The document with id 2 is always returned. The document with 
> id 1 is sometimes returned.
> Different proximity values show the same bug - "dog dog dog dog"~5, "dog dog 
> dog dog"~100, etc show the same behavior.
> So far I've traced it down to the "repeats" array in 
> SloppyPhraseScorer.initPhrasePositions() - depending on the order of the 
> elements in this array, the document may or may not match. I think the 
> HashSet may be to blame, but I'm not sure - that at least seems to be where 
> the non-determinism is coming from.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-3412) SloppyPhraseScorer returns non-deterministic results for queries with many repeats

Reply via email to