[jira] [Comment Edited] (SOLR-8496) Facet search count numbers are falsified by older document versions when multi-select is used

Yonik Seeley (JIRA) Sat, 16 Jan 2016 11:07:58 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-8496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15103376#comment-15103376
 ]


Yonik Seeley edited comment on SOLR-8496 at 1/16/16 7:06 PM:
-------------------------------------------------------------

bq. I'm concerned this bug may be hitting us in many different places besides 
facets, such as field collapsing, and exporting.

Indeed.  We may still be vulnerable , but not due to *this* bug in particular.

The change in general was LUCENE-6553, and that may yet cause bugs (like this 
one) in different areas.
Deleted docs are now only screened out before hitting the Collector.  So any 
place that does something lower level, like Weight.scorer(), is vulnerable *if* 
used in a context was was expecting only live docs.

This specific bug:
The DocSet returned from SolrIndexSearcher.getDocSet(List<Query>) could contain 
deleted documents (and that breaks our current invariants that DocSets never 
contain deleted docs).
LUCENE-6553 changed (among many others) this line: 
https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/java/org/apache/solr/search/SolrIndexSearcher.java#L2473
We used to pass liveDocs at that point, but that method signature was removed.
So now, if all clauses to be intersected are uncached, then Weight.scorer() is 
used for all of them and the intersection can thus still contain deleted docs.  
If even one clause is a normal DocSet, we're good since they do reflect 
liveDocs.

So the fix was, detect the case where all clauses are uncached (i.e. will use 
Weight.scorer) and check liveDocs in that specific case.


was (Author: [email protected]):
bq. I'm concerned this bug may be hitting us in many different places besides 
facets, such as field collapsing, and exporting.

Indeed.  We may still be vulnerable , but not due to *this* bug in particular.

The change in general was LUCENE-6553, and that may yet cause bugs (like this 
one) in different areas.
Deleted docs are now only screened out before hitting the Collector.  So any 
place that does something lower level, like Weight.scorer(), is vulnerable *if* 
used in a context was was expecting only live docs.

This specific bug:
The DocSet returned from SolrIndexSearcher.getDocSet(List<Query>) could contain 
deleted documents (and that breaks our current invariants that DocSets never 
contain deleted docs).
LUCENE-6553 changed (among many others) this line: 
https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/java/org/apache/solr/search/SolrIndexSearcher.java#L2473
We used to pass liveDocs at that point, but that method signature was removed.
So now, if all clauses to be intersected are uncached, then Weight.scorer() is 
used for all of them and the intersection can thus still contain deleted docs.  
If even one clause is a normal DocSet, we're good since they do reflect 
liveDocs.


> Facet search count numbers are falsified by older document versions when 
> multi-select is used
> ---------------------------------------------------------------------------------------------
>
>                 Key: SOLR-8496
>                 URL: https://issues.apache.org/jira/browse/SOLR-8496
>             Project: Solr
>          Issue Type: Bug
>    Affects Versions: 5.4
>         Environment: Linux 3.16.0-4-amd64 x86_64 Debian 8.2
> openjdk-7-jre-headless:amd64   version 7u91-2.6.3-1~deb8u1
> solr-5.4.0, extracted from official tar
> Default solr settings from install script:SOLR_HEAP="512m"
> GC_LOG_OPTS="-verbose:gc -XX:+PrintHeapAtGC -XX:+PrintGCDetails \
> -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution 
> -XX:+PrintGCApplicationStoppedTime"
> GC_TUNE="-XX:NewRatio=3 \
> -XX:SurvivorRatio=4 \
> -XX:TargetSurvivorRatio=90 \
> -XX:MaxTenuringThreshold=8 \
> -XX:+UseConcMarkSweepGC \
> -XX:+UseParNewGC \
> -XX:ConcGCThreads=4 -XX:ParallelGCThreads=4 \
> -XX:+CMSScavengeBeforeRemark \
> -XX:PretenureSizeThreshold=64m \
> -XX:+UseCMSInitiatingOccupancyOnly \
> -XX:CMSInitiatingOccupancyFraction=50 \
> -XX:CMSMaxAbortablePrecleanTime=6000 \
> -XX:+CMSParallelRemarkEnabled \
> -XX:+ParallelRefProcEnabled"
> SOLR_OPTS="$SOLR_OPTS -Xss256k"
>            Reporter: Andreas Müller
>            Assignee: Yonik Seeley
>             Fix For: 5.5, Trunk
>
>         Attachments: SOLR-8496.patch
>
>
> Our setup is based on multiple cores. In One core we have a multi-filed with 
> integer values. and some other unimportant fields. We're using multi-faceting 
> for this field.
> We're querying a test scenario with:
> {code}
> http://localhost:8983/solr/core-name/select?q=dummyask: (true) AND 
> manufacturer: false AND id: (15039 16882 10850 
> 20781)&fq={!tag=professions}professions: 
> (59)&fl=id&wt=json&indent=true&facet=true&facet.field={!ex=professions}professions
> {code}
> - Query: (numDocs:48545, maxDoc:48545)
> {code:xml}
> <response>
> <lst name="responseHeader">
> <int name="status">0</int>
> <int name="QTime">1</int>
> </lst>
> <result name="response" numFound="4" start="0">
> <doc>
> <int name="id">10850</int>
> </doc>
> <doc>
> <int name="id">16882</int>
> </doc>
> <doc>
> <int name="id">15039</int>
> </doc>
> <doc>
> <int name="id">20781</int>
> </doc>
> </result>
> <lst name="facet_counts">
> <lst name="facet_queries"/>
> <lst name="facet_fields">
> <lst name="professions">
> <int name="59">4</int>
> </lst>
> </lst>
> <lst name="facet_dates"/>
> <lst name="facet_ranges"/>
> <lst name="facet_intervals"/>
> <lst name="facet_heatmaps"/>
> </lst>
> </response>
> {code}
> - Then we update one document and change some fields (numDocs:48545, 
> maxDoc:48546) *The number of maxDocs is increased*
> {code:xml}
> <response>
> <lst name="responseHeader">
> <int name="status">0</int>
> <int name="QTime">1</int>
> </lst>
> <result name="response" numFound="4" start="0">
> <doc>
> <int name="id">10850</int>
> </doc>
> <doc>
> <int name="id">16882</int>
> </doc>
> <doc>
> <int name="id">15039</int>
> </doc>
> <doc>
> <int name="id">20781</int>
> </doc>
> </result>
> <lst name="facet_counts">
> <lst name="facet_queries"/>
> <lst name="facet_fields">
> <lst name="professions">
> <int name="59">5</int>
> </lst>
> </lst>
> <lst name="facet_dates"/>
> <lst name="facet_ranges"/>
> <lst name="facet_intervals"/>
> <lst name="facet_heatmaps"/>
> </lst>
> </response>
> {code}
> *The Problem:*
> In the first query, we're getting a facet count of 4, which is correct. After 
> updating one document, we're getting 5 as a result wich is not correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SOLR-8496) Facet search count numbers are falsified by older document versions when multi-select is used

Reply via email to