[jira] [Created] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets

Hoss Man (JIRA) Thu, 10 May 2018 14:36:32 -0700

Hoss Man created SOLR-12343:
-------------------------------

             Summary: JSON Field Facet refinement can return incorrect 
counts/stats for sorted buckets
                 Key: SOLR-12343
                 URL: https://issues.apache.org/jira/browse/SOLR-12343
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
            Reporter: Hoss Man



The way JSON Facet's simple refinement "re-sorts" buckets after refinement can 
cause _refined_ buckets to be "bumped out" of the topN based on the refined 
counts/stats depending on the sort - causing _unrefined_ buckets originally 
discounted in phase#2 to bubble up into the topN and be returned to clients 
*with inaccurate counts/stats*

The simplest way to demonstrate this bug (in some data sets) is with a {{sort: 
'count asc'}} facet:
 * assume shard1 returns termX & termY in phase#1 because they have very low 
shard1 counts
 ** but *not* returned at all by shard2, because these terms both have very 
high shard2 counts.
 * Assume termX has a slightly lower shard1 count then termY, such that:
 ** termX "makes the cut" off for the limit=N topN buckets
 ** termY does not make the cut, and is the "N+1" known bucket at the end of 
phase#1
 * termX then gets included in the phase#2 refinement request against shard2
 ** termX now has a much higher _known_ total count then termY
 ** the coordinator now sorts termX "worse" in the sorted list of buckets then 
termY
 ** which causes termY to bubble up into the topN
 * termY is ultimately included in the final result _with incomplete 
count/stat/sub-facet data_ instead of termX
 ** this is all indepenent of the possibility that termY may actually have a 
significantly higher total count then termX across the entire collection
 ** the key problem is that all/most of the other terms returned to the client 
have counts/stats that are the cumulation of all shards, but termY only has the 
contributions from shard1

Important Notes:
 * This scenerio can happen regardless of the amount of overrequest used. 
Additional overrequest just increases the number of "extra" terms needed in the 
index with "better" sort values then termX & termY in shard2
 * {{sort: 'count asc'}} is not just an exceptional/pathelogical case:
 ** any function sort where additional data provided shards during refinement 
can cause a bucket to "sort worse" can also cause this problem.
 ** Examples: {{sum(price_i) asc}} , {{min(price_i) desc}} , {{avg(price_i) 
asc|desc}} , etc...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets

Reply via email to