[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets

Steve Rowe (JIRA) Wed, 20 Jun 2018 15:57:12 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16518698#comment-16518698
 ]


Steve Rowe commented on SOLR-12343:
-----------------------------------

Not sure if it relates to this bug -- please move/add if not -- but my Jenkins 
found a reproducing failure for {{TestCloudJSONFacetSKG.testBespoke()}}:

{noformat}
Checking out Revision 008bc74bebef96414f19118a267dbf982aba58b9 
(refs/remotes/origin/master)
[...]
ant test  -Dtestcase=TestCloudJSONFacetSKG -Dtests.method=testBespoke 
-Dtests.seed=5D223D88BF5BF89 -Dtests.slow=true -Dtests.locale=bg-BG 
-Dtests.timezone=America/Asuncion -Dtests.asserts=true 
-Dtests.file.encoding=ISO-8859-1
   [junit4] FAILURE 0.11s J0  | TestCloudJSONFacetSKG.testBespoke <<<
   [junit4]    > Throwable #1: java.lang.AssertionError: Didn't check a single 
bucket???
   [junit4]    >        at 
__randomizedtesting.SeedInfo.seed([5D223D88BF5BF89:E09A7E14375787E]:0)
   [junit4]    >        at 
org.apache.solr.cloud.TestCloudJSONFacetSKG.testBespoke(TestCloudJSONFacetSKG.java:219)
   [junit4]    >        at java.lang.Thread.run(Thread.java:748)
[...]
   [junit4]   2> NOTE: test params are: 
codec=FastCompressingStoredFields(storedFieldsFormat=CompressingStoredFieldsFormat(compressionMode=FAST,
 chunkSize=4, maxDocsPerChunk=1, blockSize=332), 
termVectorsFormat=CompressingTermVectorsFormat(compressionMode=FAST, 
chunkSize=4, blockSize=332)), 
sim=Asserting(org.apache.lucene.search.similarities.AssertingSimilarity@4052d535),
 locale=el, timezone=Indian/Antananarivo
   [junit4]   2> NOTE: Linux 4.1.0-custom2-amd64 amd64/Oracle Corporation 
1.8.0_151 (64-bit)/cpus=16,threads=1,free=213710424,total=526909440
{noformat}

> JSON Field Facet refinement can return incorrect counts/stats for sorted 
> buckets
> --------------------------------------------------------------------------------
>
>                 Key: SOLR-12343
>                 URL: https://issues.apache.org/jira/browse/SOLR-12343
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>            Priority: Major
>         Attachments: SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch
>
>
> The way JSON Facet's simple refinement "re-sorts" buckets after refinement 
> can cause _refined_ buckets to be "bumped out" of the topN based on the 
> refined counts/stats depending on the sort - causing _unrefined_ buckets 
> originally discounted in phase#2 to bubble up into the topN and be returned 
> to clients *with inaccurate counts/stats*
> The simplest way to demonstrate this bug (in some data sets) is with a 
> {{sort: 'count asc'}} facet:
>  * assume shard1 returns termX & termY in phase#1 because they have very low 
> shard1 counts
>  ** but *not* returned at all by shard2, because these terms both have very 
> high shard2 counts.
>  * Assume termX has a slightly lower shard1 count then termY, such that:
>  ** termX "makes the cut" off for the limit=N topN buckets
>  ** termY does not make the cut, and is the "N+1" known bucket at the end of 
> phase#1
>  * termX then gets included in the phase#2 refinement request against shard2
>  ** termX now has a much higher _known_ total count then termY
>  ** the coordinator now sorts termX "worse" in the sorted list of buckets 
> then termY
>  ** which causes termY to bubble up into the topN
>  * termY is ultimately included in the final result _with incomplete 
> count/stat/sub-facet data_ instead of termX
>  ** this is all indepenent of the possibility that termY may actually have a 
> significantly higher total count then termX across the entire collection
>  ** the key problem is that all/most of the other terms returned to the 
> client have counts/stats that are the cumulation of all shards, but termY 
> only has the contributions from shard1
> Important Notes:
>  * This scenerio can happen regardless of the amount of overrequest used. 
> Additional overrequest just increases the number of "extra" terms needed in 
> the index with "better" sort values then termX & termY in shard2
>  * {{sort: 'count asc'}} is not just an exceptional/pathelogical case:
>  ** any function sort where additional data provided shards during refinement 
> can cause a bucket to "sort worse" can also cause this problem.
>  ** Examples: {{sum(price_i) asc}} , {{min(price_i) desc}} , {{avg(price_i) 
> asc|desc}} , etc...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets

Reply via email to