[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets

2018-07-19 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16549603#comment-16549603
 ] 

ASF subversion and git services commented on SOLR-12343:


Commit 3a5d4a25df310d2021fa947ea593cc9b3c93a386 in lucene-solr's branch 
refs/heads/master from Chris Hostetter
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3a5d4a2 ]

SOLR-12343: Fixed a bug in JSON Faceting that could cause incorrect 
counts/stats when using non default sort options

This also adds a new configurable "overrefine" option


> JSON Field Facet refinement can return incorrect counts/stats for sorted 
> buckets
> 
>
> Key: SOLR-12343
> URL: https://issues.apache.org/jira/browse/SOLR-12343
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Yonik Seeley
>Priority: Major
> Attachments: SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, 
> SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, 
> SOLR-12343.patch, SOLR-12343.patch, __incomplete_processEmpty_microfix.patch
>
>
> The way JSON Facet's simple refinement "re-sorts" buckets after refinement 
> can cause _refined_ buckets to be "bumped out" of the topN based on the 
> refined counts/stats depending on the sort - causing _unrefined_ buckets 
> originally discounted in phase#2 to bubble up into the topN and be returned 
> to clients *with inaccurate counts/stats*
> The simplest way to demonstrate this bug (in some data sets) is with a 
> {{sort: 'count asc'}} facet:
>  * assume shard1 returns termX & termY in phase#1 because they have very low 
> shard1 counts
>  ** but *not* returned at all by shard2, because these terms both have very 
> high shard2 counts.
>  * Assume termX has a slightly lower shard1 count then termY, such that:
>  ** termX "makes the cut" off for the limit=N topN buckets
>  ** termY does not make the cut, and is the "N+1" known bucket at the end of 
> phase#1
>  * termX then gets included in the phase#2 refinement request against shard2
>  ** termX now has a much higher _known_ total count then termY
>  ** the coordinator now sorts termX "worse" in the sorted list of buckets 
> then termY
>  ** which causes termY to bubble up into the topN
>  * termY is ultimately included in the final result _with incomplete 
> count/stat/sub-facet data_ instead of termX
>  ** this is all indepenent of the possibility that termY may actually have a 
> significantly higher total count then termX across the entire collection
>  ** the key problem is that all/most of the other terms returned to the 
> client have counts/stats that are the cumulation of all shards, but termY 
> only has the contributions from shard1
> Important Notes:
>  * This scenerio can happen regardless of the amount of overrequest used. 
> Additional overrequest just increases the number of "extra" terms needed in 
> the index with "better" sort values then termX & termY in shard2
>  * {{sort: 'count asc'}} is not just an exceptional/pathelogical case:
>  ** any function sort where additional data provided shards during refinement 
> can cause a bucket to "sort worse" can also cause this problem.
>  ** Examples: {{sum(price_i) asc}} , {{min(price_i) desc}} , {{avg(price_i) 
> asc|desc}} , etc...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets

2018-07-19 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16549602#comment-16549602
 ] 

ASF subversion and git services commented on SOLR-12343:


Commit a7fe950074a834edc070c265df1394181b268683 in lucene-solr's branch 
refs/heads/branch_7x from Chris Hostetter
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a7fe950 ]

SOLR-12343: Fixed a bug in JSON Faceting that could cause incorrect 
counts/stats when using non default sort options

This also adds a new configurable "overrefine" option

(cherry picked from commit 3a5d4a25df310d2021fa947ea593cc9b3c93a386)


> JSON Field Facet refinement can return incorrect counts/stats for sorted 
> buckets
> 
>
> Key: SOLR-12343
> URL: https://issues.apache.org/jira/browse/SOLR-12343
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Yonik Seeley
>Priority: Major
> Attachments: SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, 
> SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, 
> SOLR-12343.patch, SOLR-12343.patch, __incomplete_processEmpty_microfix.patch
>
>
> The way JSON Facet's simple refinement "re-sorts" buckets after refinement 
> can cause _refined_ buckets to be "bumped out" of the topN based on the 
> refined counts/stats depending on the sort - causing _unrefined_ buckets 
> originally discounted in phase#2 to bubble up into the topN and be returned 
> to clients *with inaccurate counts/stats*
> The simplest way to demonstrate this bug (in some data sets) is with a 
> {{sort: 'count asc'}} facet:
>  * assume shard1 returns termX & termY in phase#1 because they have very low 
> shard1 counts
>  ** but *not* returned at all by shard2, because these terms both have very 
> high shard2 counts.
>  * Assume termX has a slightly lower shard1 count then termY, such that:
>  ** termX "makes the cut" off for the limit=N topN buckets
>  ** termY does not make the cut, and is the "N+1" known bucket at the end of 
> phase#1
>  * termX then gets included in the phase#2 refinement request against shard2
>  ** termX now has a much higher _known_ total count then termY
>  ** the coordinator now sorts termX "worse" in the sorted list of buckets 
> then termY
>  ** which causes termY to bubble up into the topN
>  * termY is ultimately included in the final result _with incomplete 
> count/stat/sub-facet data_ instead of termX
>  ** this is all indepenent of the possibility that termY may actually have a 
> significantly higher total count then termX across the entire collection
>  ** the key problem is that all/most of the other terms returned to the 
> client have counts/stats that are the cumulation of all shards, but termY 
> only has the contributions from shard1
> Important Notes:
>  * This scenerio can happen regardless of the amount of overrequest used. 
> Additional overrequest just increases the number of "extra" terms needed in 
> the index with "better" sort values then termX & termY in shard2
>  * {{sort: 'count asc'}} is not just an exceptional/pathelogical case:
>  ** any function sort where additional data provided shards during refinement 
> can cause a bucket to "sort worse" can also cause this problem.
>  ** Examples: {{sum(price_i) asc}} , {{min(price_i) desc}} , {{avg(price_i) 
> asc|desc}} , etc...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets

2018-07-17 Thread Lucene/Solr QA (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546927#comment-16546927
 ] 

Lucene/Solr QA commented on SOLR-12343:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 3 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m 
21s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m 
10s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  2m 
10s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Release audit (RAT) {color} | 
{color:green}  2m 18s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Check forbidden APIs {color} | 
{color:green}  2m 10s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Validate source patterns {color} | 
{color:green}  2m 10s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Validate ref guide {color} | 
{color:green}  2m 10s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}179m 55s{color} 
| {color:red} core in the patch failed. {color} |
| {color:black}{color} | {color:black} {color} | {color:black}189m 14s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | solr.cloud.autoscaling.IndexSizeTriggerTest |
|   | solr.cloud.api.collections.ShardSplitTest |
|   | solr.cloud.autoscaling.sim.TestGenericDistributedQueue |
|   | solr.handler.component.InfixSuggestersTest |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | SOLR-12343 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12931845/SOLR-12343.patch |
| Optional Tests |  compile  javac  unit  ratsources  checkforbiddenapis  
validatesourcepatterns  validaterefguide  |
| uname | Linux lucene1-us-west 3.13.0-88-generic #135-Ubuntu SMP Wed Jun 8 
21:10:42 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | ant |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-SOLR-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh
 |
| git revision | master / d730c8b |
| ant | version: Apache Ant(TM) version 1.9.3 compiled on April 8 2014 |
| Default Java | 1.8.0_172 |
| unit | 
https://builds.apache.org/job/PreCommit-SOLR-Build/144/artifact/out/patch-unit-solr_core.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-SOLR-Build/144/testReport/ |
| modules | C: solr/core solr/solr-ref-guide U: solr |
| Console output | 
https://builds.apache.org/job/PreCommit-SOLR-Build/144/console |
| Powered by | Apache Yetus 0.7.0   http://yetus.apache.org |


This message was automatically generated.



> JSON Field Facet refinement can return incorrect counts/stats for sorted 
> buckets
> 
>
> Key: SOLR-12343
> URL: https://issues.apache.org/jira/browse/SOLR-12343
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Yonik Seeley
>Priority: Major
> Attachments: SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, 
> SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, 
> SOLR-12343.patch, SOLR-12343.patch, __incomplete_processEmpty_microfix.patch
>
>
> The way JSON Facet's simple refinement "re-sorts" buckets after refinement 
> can cause _refined_ buckets to be "bumped out" of the topN based on the 
> refined counts/stats depending on the sort - causing _unrefined_ buckets 
> originally discounted in phase#2 to bubble up into the topN and be returned 
> to clients *with inaccurate counts/stats*
> The simplest way to demonstrate this bug (in some data sets) is with a 
> {{sort: 'count asc'}} facet:
>  * assume shard1 returns termX & termY in phase#1 because they have very low 
> shard1 counts
>  ** but *not* returned at all by shard2, because these terms both have very 
> high shard2 counts.
>  * Assume termX has a slightly lower shard1 count then termY, such that:
>  ** termX "makes the cut" off for the limit=N topN buckets
>  ** termY does not make the cut, and is the "N+1" known bucket at the end of 
> phase#1
>  * termX then gets included in the 

[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets

2018-07-16 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545604#comment-16545604
 ] 

Hoss Man commented on SOLR-12343:
-

{quote}but then, once all the refinement is done, and we have a fully refined 
bucketX it might now sort "lower" then an incomplete bucketY ... and 
{{isBucketComplete}} doesn't pay any attention to {{processEmpty:true}} ... so 
it sees that shardA does *not* have {{more:true}} and thinks (the incomplete) 
bucketY is ok to return.
{quote}

I haven't been able to come up with a better solution for this, and since 
processEmpty is pretty special case, I think i'm just going to break it out 
into it's own Jira, and revise the patch so that the current assertion failures 
are confined to test methods that are \@AwaitsFix'ed on that issue -- that way 
we can move forward with the existing fix that likely impacts more people.

> JSON Field Facet refinement can return incorrect counts/stats for sorted 
> buckets
> 
>
> Key: SOLR-12343
> URL: https://issues.apache.org/jira/browse/SOLR-12343
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Yonik Seeley
>Priority: Major
> Attachments: SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, 
> SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, 
> __incomplete_processEmpty_microfix.patch
>
>
> The way JSON Facet's simple refinement "re-sorts" buckets after refinement 
> can cause _refined_ buckets to be "bumped out" of the topN based on the 
> refined counts/stats depending on the sort - causing _unrefined_ buckets 
> originally discounted in phase#2 to bubble up into the topN and be returned 
> to clients *with inaccurate counts/stats*
> The simplest way to demonstrate this bug (in some data sets) is with a 
> {{sort: 'count asc'}} facet:
>  * assume shard1 returns termX & termY in phase#1 because they have very low 
> shard1 counts
>  ** but *not* returned at all by shard2, because these terms both have very 
> high shard2 counts.
>  * Assume termX has a slightly lower shard1 count then termY, such that:
>  ** termX "makes the cut" off for the limit=N topN buckets
>  ** termY does not make the cut, and is the "N+1" known bucket at the end of 
> phase#1
>  * termX then gets included in the phase#2 refinement request against shard2
>  ** termX now has a much higher _known_ total count then termY
>  ** the coordinator now sorts termX "worse" in the sorted list of buckets 
> then termY
>  ** which causes termY to bubble up into the topN
>  * termY is ultimately included in the final result _with incomplete 
> count/stat/sub-facet data_ instead of termX
>  ** this is all indepenent of the possibility that termY may actually have a 
> significantly higher total count then termX across the entire collection
>  ** the key problem is that all/most of the other terms returned to the 
> client have counts/stats that are the cumulation of all shards, but termY 
> only has the contributions from shard1
> Important Notes:
>  * This scenerio can happen regardless of the amount of overrequest used. 
> Additional overrequest just increases the number of "extra" terms needed in 
> the index with "better" sort values then termX & termY in shard2
>  * {{sort: 'count asc'}} is not just an exceptional/pathelogical case:
>  ** any function sort where additional data provided shards during refinement 
> can cause a bucket to "sort worse" can also cause this problem.
>  ** Examples: {{sum(price_i) asc}} , {{min(price_i) desc}} , {{avg(price_i) 
> asc|desc}} , etc...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets

2018-07-11 Thread Lucene/Solr QA (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16541000#comment-16541000
 ] 

Lucene/Solr QA commented on SOLR-12343:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  4m  
7s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m 
41s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  2m 
41s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Release audit (RAT) {color} | 
{color:green}  2m 55s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Check forbidden APIs {color} | 
{color:green}  2m 41s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} Validate source patterns {color} | 
{color:red}  2m 41s{color} | {color:red} Validate source patterns 
validate-source-patterns failed {color} |
| {color:red}-1{color} | {color:red} Validate ref guide {color} | {color:red}  
2m 41s{color} | {color:red} Validate source patterns validate-source-patterns 
failed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 94m 19s{color} 
| {color:red} core in the patch failed. {color} |
| {color:black}{color} | {color:black} {color} | {color:black}104m 44s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | solr.cloud.autoscaling.IndexSizeTriggerTest |
|   | solr.cloud.api.collections.ShardSplitTest |
|   | solr.search.facet.TestJsonFacetRefinement |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | SOLR-12343 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12931226/SOLR-12343.patch |
| Optional Tests |  compile  javac  unit  ratsources  checkforbiddenapis  
validatesourcepatterns  validaterefguide  |
| uname | Linux lucene2-us-west.apache.org 4.4.0-112-generic #135-Ubuntu SMP 
Fri Jan 19 11:48:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | ant |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-SOLR-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh
 |
| git revision | master / fe180bb |
| ant | version: Apache Ant(TM) version 1.9.6 compiled on July 8 2015 |
| Default Java | 1.8.0_172 |
| Validate source patterns | 
https://builds.apache.org/job/PreCommit-SOLR-Build/143/artifact/out/patch-validate-source-patterns-root.txt
 |
| Validate ref guide | 
https://builds.apache.org/job/PreCommit-SOLR-Build/143/artifact/out/patch-validate-source-patterns-root.txt
 |
| unit | 
https://builds.apache.org/job/PreCommit-SOLR-Build/143/artifact/out/patch-unit-solr_core.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-SOLR-Build/143/testReport/ |
| modules | C: solr/core solr/solr-ref-guide U: solr |
| Console output | 
https://builds.apache.org/job/PreCommit-SOLR-Build/143/console |
| Powered by | Apache Yetus 0.7.0   http://yetus.apache.org |


This message was automatically generated.



> JSON Field Facet refinement can return incorrect counts/stats for sorted 
> buckets
> 
>
> Key: SOLR-12343
> URL: https://issues.apache.org/jira/browse/SOLR-12343
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Yonik Seeley
>Priority: Major
> Attachments: SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, 
> SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, 
> __incomplete_processEmpty_microfix.patch
>
>
> The way JSON Facet's simple refinement "re-sorts" buckets after refinement 
> can cause _refined_ buckets to be "bumped out" of the topN based on the 
> refined counts/stats depending on the sort - causing _unrefined_ buckets 
> originally discounted in phase#2 to bubble up into the topN and be returned 
> to clients *with inaccurate counts/stats*
> The simplest way to demonstrate this bug (in some data sets) is with a 
> {{sort: 'count asc'}} facet:
>  * assume shard1 returns termX & termY in phase#1 because they have very low 
> shard1 counts
>  ** but *not* returned at all by shard2, because these terms both have very 
> high shard2 counts.
>  * 

[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets

2018-07-11 Thread Lucene/Solr QA (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16540576#comment-16540576
 ] 

Lucene/Solr QA commented on SOLR-12343:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 13m 
36s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 16m 
20s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 16m 
20s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Release audit (RAT) {color} | 
{color:green} 17m 13s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Check forbidden APIs {color} | 
{color:green} 16m 20s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Validate source patterns {color} | 
{color:green} 16m 20s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Validate ref guide {color} | 
{color:green} 16m 20s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}153m 34s{color} 
| {color:red} core in the patch failed. {color} |
| {color:black}{color} | {color:black} {color} | {color:black}213m 21s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
solr.cloud.api.collections.TestCollectionsAPIViaSolrCloudCluster |
|   | solr.cloud.cdcr.CdcrBidirectionalTest |
|   | solr.cloud.autoscaling.sim.TestExecutePlanAction |
|   | solr.cloud.autoscaling.SearchRateTriggerIntegrationTest |
|   | solr.cloud.api.collections.ShardSplitTest |
|   | solr.cloud.autoscaling.sim.TestLargeCluster |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | SOLR-12343 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12930878/SOLR-12343.patch |
| Optional Tests |  compile  javac  unit  ratsources  checkforbiddenapis  
validatesourcepatterns  validaterefguide  |
| uname | Linux lucene2-us-west.apache.org 4.4.0-112-generic #135-Ubuntu SMP 
Fri Jan 19 11:48:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | ant |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-SOLR-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh
 |
| git revision | master / fe180bb |
| ant | version: Apache Ant(TM) version 1.9.6 compiled on July 8 2015 |
| Default Java | 1.8.0_172 |
| unit | 
https://builds.apache.org/job/PreCommit-SOLR-Build/142/artifact/out/patch-unit-solr_core.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-SOLR-Build/142/testReport/ |
| modules | C: solr/core solr/solr-ref-guide U: solr |
| Console output | 
https://builds.apache.org/job/PreCommit-SOLR-Build/142/console |
| Powered by | Apache Yetus 0.7.0   http://yetus.apache.org |


This message was automatically generated.



> JSON Field Facet refinement can return incorrect counts/stats for sorted 
> buckets
> 
>
> Key: SOLR-12343
> URL: https://issues.apache.org/jira/browse/SOLR-12343
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Yonik Seeley
>Priority: Major
> Attachments: SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, 
> SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, 
> __incomplete_processEmpty_microfix.patch
>
>
> The way JSON Facet's simple refinement "re-sorts" buckets after refinement 
> can cause _refined_ buckets to be "bumped out" of the topN based on the 
> refined counts/stats depending on the sort - causing _unrefined_ buckets 
> originally discounted in phase#2 to bubble up into the topN and be returned 
> to clients *with inaccurate counts/stats*
> The simplest way to demonstrate this bug (in some data sets) is with a 
> {{sort: 'count asc'}} facet:
>  * assume shard1 returns termX & termY in phase#1 because they have very low 
> shard1 counts
>  ** but *not* returned at all by shard2, because these terms both have very 
> high shard2 counts.
>  * Assume termX has a slightly lower shard1 count then termY, such that:
>  ** termX "makes the cut" off for the limit=N topN buckets
>  ** termY does not make the cut, and is 

[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets

2018-07-11 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16540330#comment-16540330
 ] 

Hoss Man commented on SOLR-12343:
-

[~ysee...@gmail.com] - i've been testing this out with the SKG (relatedness()) 
function -- where i initially discovered bug -- and trying to remove the 
workarounds for this that are currently in TestCloudJSONFacetSKG (grep for 
SOLR-12343) but i'm seeing some failures that I think i've traced back to a 
mistake in isBucketComplete() that _only_ affects facets using 
{{processEmpty:true}} ...

{panel}
in {{getRefinement()}} you've got {{returnedAllBuckets}} taking into 
consideration {{processEmpty:true}} -- so that even if a shardA doesn't say it 
has {{more:true}} we will still send it candidate bucketX for refinement if we 
didn't explicitly {{saw}} bucketX on shardA.  so far so good.

but then, once all the refinement is done, and we have a fully refined bucketX 
it might now sort "lower" then an incomplete bucketY ... and 
{{isBucketComplete}} doesn't pay any attention to {{processEmpty:true}} ... so 
it sees that shardA does *not* have {{more:true}} and thinks (the incomplete) 
bucketY is ok to return.
{panel}

...I'll work up an isolated test case

> JSON Field Facet refinement can return incorrect counts/stats for sorted 
> buckets
> 
>
> Key: SOLR-12343
> URL: https://issues.apache.org/jira/browse/SOLR-12343
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Yonik Seeley
>Priority: Major
> Attachments: SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, 
> SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch
>
>
> The way JSON Facet's simple refinement "re-sorts" buckets after refinement 
> can cause _refined_ buckets to be "bumped out" of the topN based on the 
> refined counts/stats depending on the sort - causing _unrefined_ buckets 
> originally discounted in phase#2 to bubble up into the topN and be returned 
> to clients *with inaccurate counts/stats*
> The simplest way to demonstrate this bug (in some data sets) is with a 
> {{sort: 'count asc'}} facet:
>  * assume shard1 returns termX & termY in phase#1 because they have very low 
> shard1 counts
>  ** but *not* returned at all by shard2, because these terms both have very 
> high shard2 counts.
>  * Assume termX has a slightly lower shard1 count then termY, such that:
>  ** termX "makes the cut" off for the limit=N topN buckets
>  ** termY does not make the cut, and is the "N+1" known bucket at the end of 
> phase#1
>  * termX then gets included in the phase#2 refinement request against shard2
>  ** termX now has a much higher _known_ total count then termY
>  ** the coordinator now sorts termX "worse" in the sorted list of buckets 
> then termY
>  ** which causes termY to bubble up into the topN
>  * termY is ultimately included in the final result _with incomplete 
> count/stat/sub-facet data_ instead of termX
>  ** this is all indepenent of the possibility that termY may actually have a 
> significantly higher total count then termX across the entire collection
>  ** the key problem is that all/most of the other terms returned to the 
> client have counts/stats that are the cumulation of all shards, but termY 
> only has the contributions from shard1
> Important Notes:
>  * This scenerio can happen regardless of the amount of overrequest used. 
> Additional overrequest just increases the number of "extra" terms needed in 
> the index with "better" sort values then termX & termY in shard2
>  * {{sort: 'count asc'}} is not just an exceptional/pathelogical case:
>  ** any function sort where additional data provided shards during refinement 
> can cause a bucket to "sort worse" can also cause this problem.
>  ** Examples: {{sum(price_i) asc}} , {{min(price_i) desc}} , {{avg(price_i) 
> asc|desc}} , etc...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets

2018-07-09 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16537904#comment-16537904
 ] 

Yonik Seeley commented on SOLR-12343:
-

Looks good, thanks for tracking that down!

> JSON Field Facet refinement can return incorrect counts/stats for sorted 
> buckets
> 
>
> Key: SOLR-12343
> URL: https://issues.apache.org/jira/browse/SOLR-12343
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Yonik Seeley
>Priority: Major
> Attachments: SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, 
> SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch
>
>
> The way JSON Facet's simple refinement "re-sorts" buckets after refinement 
> can cause _refined_ buckets to be "bumped out" of the topN based on the 
> refined counts/stats depending on the sort - causing _unrefined_ buckets 
> originally discounted in phase#2 to bubble up into the topN and be returned 
> to clients *with inaccurate counts/stats*
> The simplest way to demonstrate this bug (in some data sets) is with a 
> {{sort: 'count asc'}} facet:
>  * assume shard1 returns termX & termY in phase#1 because they have very low 
> shard1 counts
>  ** but *not* returned at all by shard2, because these terms both have very 
> high shard2 counts.
>  * Assume termX has a slightly lower shard1 count then termY, such that:
>  ** termX "makes the cut" off for the limit=N topN buckets
>  ** termY does not make the cut, and is the "N+1" known bucket at the end of 
> phase#1
>  * termX then gets included in the phase#2 refinement request against shard2
>  ** termX now has a much higher _known_ total count then termY
>  ** the coordinator now sorts termX "worse" in the sorted list of buckets 
> then termY
>  ** which causes termY to bubble up into the topN
>  * termY is ultimately included in the final result _with incomplete 
> count/stat/sub-facet data_ instead of termX
>  ** this is all indepenent of the possibility that termY may actually have a 
> significantly higher total count then termX across the entire collection
>  ** the key problem is that all/most of the other terms returned to the 
> client have counts/stats that are the cumulation of all shards, but termY 
> only has the contributions from shard1
> Important Notes:
>  * This scenerio can happen regardless of the amount of overrequest used. 
> Additional overrequest just increases the number of "extra" terms needed in 
> the index with "better" sort values then termX & termY in shard2
>  * {{sort: 'count asc'}} is not just an exceptional/pathelogical case:
>  ** any function sort where additional data provided shards during refinement 
> can cause a bucket to "sort worse" can also cause this problem.
>  ** Examples: {{sum(price_i) asc}} , {{min(price_i) desc}} , {{avg(price_i) 
> asc|desc}} , etc...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets

2018-07-09 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16537306#comment-16537306
 ] 

Hoss Man commented on SOLR-12343:
-

Ok ... fresh eyes and i see the problem.

When {{final int overreq = 0}} we don't add any "filler" docs, which means that 
when the nested facet test happens, shardC0 and shardC1 disagree about the "top 
term" for the parent facet on the {{all_ss}} field -- shardC0 only knows about 
{{z_al}} while shardC1 has a tie between {{z_all} and {{some}} and {{some}} 
wins the tie due to index order -- so when that parent facet uses 
{{overrequest:0}} the initial merge logic doesn't have any contributions from 
shardC1 for the chosen {{all_ss:z_all}} bucket ... so it only knows to ask to 
refine the top3 child buckets it does know about (from shardC0): "A,B,C".  If 
the parent facet uses any overrequest larger then 0, then it would get the 
{{all_ss:z_all}} bucket from shardC1 as well, and have some child buckets to 
consider to know that C is a bad candidate, and it should be refining X instead.

On the flip side, when {{final int overreq = 1}} (or anything higher) the 
addition of even a few filler docs is enough to skew the {{all_ss}} term stats 
on shardC1, such that it *also* thinkgs {{z_all}} is the top term, so 
regardless of the amount of overrequest on the top facet, the phase #1 merge 
has buckets from both shards for the child facet to consider.



I remember when i was writing this test, and i include the {{some}} terms the 
entire point was to stress the case where the 2 shards disagree about the "top" 
term term from the parent facet -- but apparently when adding the filler 
docs/terms randomization i broke that so that it's not always true, it only 
happens when there are no filler docs.  But it also seems like an unfair test, 
because when they do disagree, there's no reason for hte merge logic to think X 
is a worthwhile term to refine. what mattes is that in this case, C is 
accurately refined

I'm working up a test fix...


> JSON Field Facet refinement can return incorrect counts/stats for sorted 
> buckets
> 
>
> Key: SOLR-12343
> URL: https://issues.apache.org/jira/browse/SOLR-12343
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Yonik Seeley
>Priority: Major
> Attachments: SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, 
> SOLR-12343.patch, SOLR-12343.patch
>
>
> The way JSON Facet's simple refinement "re-sorts" buckets after refinement 
> can cause _refined_ buckets to be "bumped out" of the topN based on the 
> refined counts/stats depending on the sort - causing _unrefined_ buckets 
> originally discounted in phase#2 to bubble up into the topN and be returned 
> to clients *with inaccurate counts/stats*
> The simplest way to demonstrate this bug (in some data sets) is with a 
> {{sort: 'count asc'}} facet:
>  * assume shard1 returns termX & termY in phase#1 because they have very low 
> shard1 counts
>  ** but *not* returned at all by shard2, because these terms both have very 
> high shard2 counts.
>  * Assume termX has a slightly lower shard1 count then termY, such that:
>  ** termX "makes the cut" off for the limit=N topN buckets
>  ** termY does not make the cut, and is the "N+1" known bucket at the end of 
> phase#1
>  * termX then gets included in the phase#2 refinement request against shard2
>  ** termX now has a much higher _known_ total count then termY
>  ** the coordinator now sorts termX "worse" in the sorted list of buckets 
> then termY
>  ** which causes termY to bubble up into the topN
>  * termY is ultimately included in the final result _with incomplete 
> count/stat/sub-facet data_ instead of termX
>  ** this is all indepenent of the possibility that termY may actually have a 
> significantly higher total count then termX across the entire collection
>  ** the key problem is that all/most of the other terms returned to the 
> client have counts/stats that are the cumulation of all shards, but termY 
> only has the contributions from shard1
> Important Notes:
>  * This scenerio can happen regardless of the amount of overrequest used. 
> Additional overrequest just increases the number of "extra" terms needed in 
> the index with "better" sort values then termX & termY in shard2
>  * {{sort: 'count asc'}} is not just an exceptional/pathelogical case:
>  ** any function sort where additional data provided shards during refinement 
> can cause a bucket to "sort worse" can also cause this problem.
>  ** Examples: {{sum(price_i) asc}} , {{min(price_i) desc}} , {{avg(price_i) 
> asc|desc}} , etc...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets

2018-07-08 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16536494#comment-16536494
 ] 

Hoss Man commented on SOLR-12343:
-

Found one – it seems to be specific to the situation where {{overrequest==0}}, 
and the facet is nested under another facet?

playing the with values of {{top_over}} and {{top_refine}} it doesn't seem to 
matter if parent facet is refined, but the key is wether the top facet also 
uses {{overrequest:0}} (fails) or {{overrequest:999}} (passes)

 
{noformat}
   [junit4]   2> 9990 INFO  (qtp1276305453-48) [x:collection1] 
o.a.s.c.S.Request [collection1]  webapp=/solr path=/select 
params={df=text=false&_facet_={}=id=score=1048580=0=true=127.0.0.1:47372/solr/collection1=0=2=*:*={+all:{+type:terms,+field:all_ss,+limit:1,+refine:true,+overrequest:0+++,+facet:{+++cat_count:{+type:terms,+field:cat_s,+limit:3,+overrequest:0+++,+refine:true,+sort:'count+asc'+},+++cat_price:{+type:terms,+field:cat_s,+limit:3,+overrequest:0+++,+refine:true,+sort:'sum_p+asc',+facet:+{+sum_p:+'sum(price_i)'+}+}}+}+}=1531102182236=true=javabin}
 hits=9 status=0 QTime=17
   [junit4]   2> 9994 INFO  (qtp1276305453-49) [x:collection1] 
o.a.s.c.S.Request [collection1]  webapp=/solr path=/select 
params={df=text=false&_facet_={"refine":{"all":{"_p":[["z_all",{"cat_count":{"_l":["A","B","C"]},"cat_price":{"_l":["A","B","C"]}}]]}}}=2097152=127.0.0.1:47372/solr/collection1=0=2=*:*={+all:{+type:terms,+field:all_ss,+limit:1,+refine:true,+overrequest:0+++,+facet:{+++cat_count:{+type:terms,+field:cat_s,+limit:3,+overrequest:0+++,+refine:true,+sort:'count+asc'+},+++cat_price:{+type:terms,+field:cat_s,+limit:3,+overrequest:0+++,+refine:true,+sort:'sum_p+asc',+facet:+{+sum_p:+'sum(price_i)'+}+}}+}+}=1531102182236=true=false=javabin}
 hits=9 status=0 QTime=1
   [junit4]   2> 9996 INFO  (qtp1503674478-65) [x:collection1] 
o.a.s.c.S.Request [collection1]  webapp=/solr path=/select 
params={shards=127.0.0.1:54950/solr/collection1,127.0.0.1:47372/solr/collection1,127.0.0.1:52833/solr/collection1=debugQuery=true=*:*={+all:{+type:terms,+field:all_ss,+limit:1,+refine:true,+overrequest:0+++,+facet:{+++cat_count:{+type:terms,+field:cat_s,+limit:3,+overrequest:0+++,+refine:true,+sort:'count+asc'+},+++cat_price:{+type:terms,+field:cat_s,+limit:3,+overrequest:0+++,+refine:true,+sort:'sum_p+asc',+facet:+{+sum_p:+'sum(price_i)'+}+}}+}+}=true=0=json=2.2}
 hits=19 status=0 QTime=25
   [junit4]   2> 9997 ERROR 
(TEST-TestJsonFacetRefinement.testSortedFacetRefinementPushingNonRefinedBucketBackIntoTopN-seed#[775BF43EF8268D50])
 [] o.a.s.SolrTestCaseHS query failed JSON validation. error=mismatch: 
'X'!='C' @ facets/all/buckets/[0]/cat_count/buckets/[2]/val
   [junit4]   2>  expected =facets=={ count: 19,all:{ buckets:[   { val:z_all, 
count: 19,cat_count:{ buckets:[  {val:A,count:1},   
  {val:B,count:1}, {val:X,count:4},] },cat_price:{ 
buckets:[  {val:A,count:1,sum_p:1.0}, 
{val:B,count:1,sum_p:1.0}, {val:X,count:4,sum_p:4.0},] }} ] 
} }
   [junit4]   2>  response = {
   [junit4]   2>   "responseHeader":{
   [junit4]   2> "status":0,
   [junit4]   2> "QTime":25},
   [junit4]   2>   "response":{"numFound":19,"start":0,"maxScore":1.0,"docs":[]
   [junit4]   2>   },
   [junit4]   2>   "facets":{
   [junit4]   2> "count":19,
   [junit4]   2> "all":{
   [junit4]   2>   "buckets":[{
   [junit4]   2>   "val":"z_all",
   [junit4]   2>   "count":19,
   [junit4]   2>   "cat_price":{
   [junit4]   2> "buckets":[{
   [junit4]   2> "val":"A",
   [junit4]   2> "count":1,
   [junit4]   2> "sum_p":1.0},
   [junit4]   2>   {
   [junit4]   2> "val":"B",
   [junit4]   2> "count":1,
   [junit4]   2> "sum_p":1.0},
   [junit4]   2>   {
   [junit4]   2> "val":"C",
   [junit4]   2> "count":6,
   [junit4]   2> "sum_p":6.0}]},
   [junit4]   2>   "cat_count":{
   [junit4]   2> "buckets":[{
   [junit4]   2> "val":"A",
   [junit4]   2> "count":1},
   [junit4]   2>   {
   [junit4]   2> "val":"B",
   [junit4]   2> "count":1},
   [junit4]   2>   {
   [junit4]   2> "val":"C",
   [junit4]   2> "count":6}]}}]}}}
   [junit4]   2> 
   [junit4]   2> 1 INFO  
(TEST-TestJsonFacetRefinement.testSortedFacetRefinementPushingNonRefinedBucketBackIntoTopN-seed#[775BF43EF8268D50])
 [] o.a.s.SolrTestCaseJ4 ###Ending 
testSortedFacetRefinementPushingNonRefinedBucketBackIntoTopN

[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets

2018-07-08 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16536481#comment-16536481
 ] 

Hoss Man commented on SOLR-12343:
-

which assertion?  stacktrace? reproduce line? .. does the seed actually 
reproduce?

There's virtually no randomization in the test at all, except for the number of 
fillter termss/overrequest.

If you're seeing you're seeing a seed that reproduces, it makes me wonder if 
there is an edge case / off by one error based on the number of buckets ... if 
the seed doesn't reproduce (reliably) then it makes me wonder if it's an edge 
case that has to do with with which order the shards respond (ie: how the 
merger initializes the datastructs that get merged)

> JSON Field Facet refinement can return incorrect counts/stats for sorted 
> buckets
> 
>
> Key: SOLR-12343
> URL: https://issues.apache.org/jira/browse/SOLR-12343
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Yonik Seeley
>Priority: Major
> Attachments: SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, 
> SOLR-12343.patch, SOLR-12343.patch
>
>
> The way JSON Facet's simple refinement "re-sorts" buckets after refinement 
> can cause _refined_ buckets to be "bumped out" of the topN based on the 
> refined counts/stats depending on the sort - causing _unrefined_ buckets 
> originally discounted in phase#2 to bubble up into the topN and be returned 
> to clients *with inaccurate counts/stats*
> The simplest way to demonstrate this bug (in some data sets) is with a 
> {{sort: 'count asc'}} facet:
>  * assume shard1 returns termX & termY in phase#1 because they have very low 
> shard1 counts
>  ** but *not* returned at all by shard2, because these terms both have very 
> high shard2 counts.
>  * Assume termX has a slightly lower shard1 count then termY, such that:
>  ** termX "makes the cut" off for the limit=N topN buckets
>  ** termY does not make the cut, and is the "N+1" known bucket at the end of 
> phase#1
>  * termX then gets included in the phase#2 refinement request against shard2
>  ** termX now has a much higher _known_ total count then termY
>  ** the coordinator now sorts termX "worse" in the sorted list of buckets 
> then termY
>  ** which causes termY to bubble up into the topN
>  * termY is ultimately included in the final result _with incomplete 
> count/stat/sub-facet data_ instead of termX
>  ** this is all indepenent of the possibility that termY may actually have a 
> significantly higher total count then termX across the entire collection
>  ** the key problem is that all/most of the other terms returned to the 
> client have counts/stats that are the cumulation of all shards, but termY 
> only has the contributions from shard1
> Important Notes:
>  * This scenerio can happen regardless of the amount of overrequest used. 
> Additional overrequest just increases the number of "extra" terms needed in 
> the index with "better" sort values then termX & termY in shard2
>  * {{sort: 'count asc'}} is not just an exceptional/pathelogical case:
>  ** any function sort where additional data provided shards during refinement 
> can cause a bucket to "sort worse" can also cause this problem.
>  ** Examples: {{sum(price_i) asc}} , {{min(price_i) desc}} , {{avg(price_i) 
> asc|desc}} , etc...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets

2018-07-08 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16536095#comment-16536095
 ] 

Yonik Seeley commented on SOLR-12343:
-

I'm occasionally getting a failure in 
testSortedFacetRefinementPushingNonRefinedBucketBackIntoTopN
I haven't tried digging into it yet though.

> JSON Field Facet refinement can return incorrect counts/stats for sorted 
> buckets
> 
>
> Key: SOLR-12343
> URL: https://issues.apache.org/jira/browse/SOLR-12343
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Yonik Seeley
>Priority: Major
> Attachments: SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, 
> SOLR-12343.patch, SOLR-12343.patch
>
>
> The way JSON Facet's simple refinement "re-sorts" buckets after refinement 
> can cause _refined_ buckets to be "bumped out" of the topN based on the 
> refined counts/stats depending on the sort - causing _unrefined_ buckets 
> originally discounted in phase#2 to bubble up into the topN and be returned 
> to clients *with inaccurate counts/stats*
> The simplest way to demonstrate this bug (in some data sets) is with a 
> {{sort: 'count asc'}} facet:
>  * assume shard1 returns termX & termY in phase#1 because they have very low 
> shard1 counts
>  ** but *not* returned at all by shard2, because these terms both have very 
> high shard2 counts.
>  * Assume termX has a slightly lower shard1 count then termY, such that:
>  ** termX "makes the cut" off for the limit=N topN buckets
>  ** termY does not make the cut, and is the "N+1" known bucket at the end of 
> phase#1
>  * termX then gets included in the phase#2 refinement request against shard2
>  ** termX now has a much higher _known_ total count then termY
>  ** the coordinator now sorts termX "worse" in the sorted list of buckets 
> then termY
>  ** which causes termY to bubble up into the topN
>  * termY is ultimately included in the final result _with incomplete 
> count/stat/sub-facet data_ instead of termX
>  ** this is all indepenent of the possibility that termY may actually have a 
> significantly higher total count then termX across the entire collection
>  ** the key problem is that all/most of the other terms returned to the 
> client have counts/stats that are the cumulation of all shards, but termY 
> only has the contributions from shard1
> Important Notes:
>  * This scenerio can happen regardless of the amount of overrequest used. 
> Additional overrequest just increases the number of "extra" terms needed in 
> the index with "better" sort values then termX & termY in shard2
>  * {{sort: 'count asc'}} is not just an exceptional/pathelogical case:
>  ** any function sort where additional data provided shards during refinement 
> can cause a bucket to "sort worse" can also cause this problem.
>  ** Examples: {{sum(price_i) asc}} , {{min(price_i) desc}} , {{avg(price_i) 
> asc|desc}} , etc...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets

2018-07-08 Thread Lucene/Solr QA (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16536065#comment-16536065
 ] 

Lucene/Solr QA commented on SOLR-12343:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
59s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
49s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
49s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Release audit (RAT) {color} | 
{color:green}  1m 55s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Check forbidden APIs {color} | 
{color:green}  1m 49s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Validate source patterns {color} | 
{color:green}  1m 49s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Validate ref guide {color} | 
{color:green}  1m 49s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 65m 33s{color} 
| {color:red} core in the patch failed. {color} |
| {color:black}{color} | {color:black} {color} | {color:black} 73m  9s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | solr.cloud.autoscaling.IndexSizeTriggerTest |
|   | solr.cloud.api.collections.ShardSplitTest |
|   | solr.cloud.ForceLeaderTest |
|   | solr.cloud.api.collections.TestCollectionsAPIViaSolrCloudCluster |
|   | solr.cloud.autoscaling.sim.TestLargeCluster |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | SOLR-12343 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12930572/SOLR-12343.patch |
| Optional Tests |  compile  javac  unit  ratsources  checkforbiddenapis  
validatesourcepatterns  validaterefguide  |
| uname | Linux lucene1-us-west 3.13.0-88-generic #135-Ubuntu SMP Wed Jun 8 
21:10:42 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | ant |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-SOLR-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh
 |
| git revision | master / b7d14c5 |
| ant | version: Apache Ant(TM) version 1.9.3 compiled on April 8 2014 |
| Default Java | 1.8.0_172 |
| unit | 
https://builds.apache.org/job/PreCommit-SOLR-Build/140/artifact/out/patch-unit-solr_core.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-SOLR-Build/140/testReport/ |
| modules | C: solr/core solr/solr-ref-guide U: solr |
| Console output | 
https://builds.apache.org/job/PreCommit-SOLR-Build/140/console |
| Powered by | Apache Yetus 0.7.0   http://yetus.apache.org |


This message was automatically generated.



> JSON Field Facet refinement can return incorrect counts/stats for sorted 
> buckets
> 
>
> Key: SOLR-12343
> URL: https://issues.apache.org/jira/browse/SOLR-12343
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Yonik Seeley
>Priority: Major
> Attachments: SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, 
> SOLR-12343.patch, SOLR-12343.patch
>
>
> The way JSON Facet's simple refinement "re-sorts" buckets after refinement 
> can cause _refined_ buckets to be "bumped out" of the topN based on the 
> refined counts/stats depending on the sort - causing _unrefined_ buckets 
> originally discounted in phase#2 to bubble up into the topN and be returned 
> to clients *with inaccurate counts/stats*
> The simplest way to demonstrate this bug (in some data sets) is with a 
> {{sort: 'count asc'}} facet:
>  * assume shard1 returns termX & termY in phase#1 because they have very low 
> shard1 counts
>  ** but *not* returned at all by shard2, because these terms both have very 
> high shard2 counts.
>  * Assume termX has a slightly lower shard1 count then termY, such that:
>  ** termX "makes the cut" off for the limit=N topN buckets
>  ** termY does not make the cut, and is the "N+1" known bucket at the end of 
> phase#1
>  * termX then gets included in the phase#2 refinement request against shard2
>  ** termX now has a much 

[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets

2018-07-05 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534377#comment-16534377
 ] 

Yonik Seeley commented on SOLR-12343:
-

bq. it will stop returning the facet range "other" buckets completely since 
currently no code refines them at all

Hmmm, so the patch I attached seems like it would only remove incomplete 
buckets in field facets under "other" buckets (i.e. if they don't actually need 
refining to be complete, they won't be removed by the current patch).  But this 
could still be worse in some cases (missing vs incomplete when refinement is 
requested), so I agree this can wait until  SOLR-12516 is done. 

> JSON Field Facet refinement can return incorrect counts/stats for sorted 
> buckets
> 
>
> Key: SOLR-12343
> URL: https://issues.apache.org/jira/browse/SOLR-12343
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Yonik Seeley
>Priority: Major
> Attachments: SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, 
> SOLR-12343.patch
>
>
> The way JSON Facet's simple refinement "re-sorts" buckets after refinement 
> can cause _refined_ buckets to be "bumped out" of the topN based on the 
> refined counts/stats depending on the sort - causing _unrefined_ buckets 
> originally discounted in phase#2 to bubble up into the topN and be returned 
> to clients *with inaccurate counts/stats*
> The simplest way to demonstrate this bug (in some data sets) is with a 
> {{sort: 'count asc'}} facet:
>  * assume shard1 returns termX & termY in phase#1 because they have very low 
> shard1 counts
>  ** but *not* returned at all by shard2, because these terms both have very 
> high shard2 counts.
>  * Assume termX has a slightly lower shard1 count then termY, such that:
>  ** termX "makes the cut" off for the limit=N topN buckets
>  ** termY does not make the cut, and is the "N+1" known bucket at the end of 
> phase#1
>  * termX then gets included in the phase#2 refinement request against shard2
>  ** termX now has a much higher _known_ total count then termY
>  ** the coordinator now sorts termX "worse" in the sorted list of buckets 
> then termY
>  ** which causes termY to bubble up into the topN
>  * termY is ultimately included in the final result _with incomplete 
> count/stat/sub-facet data_ instead of termX
>  ** this is all indepenent of the possibility that termY may actually have a 
> significantly higher total count then termX across the entire collection
>  ** the key problem is that all/most of the other terms returned to the 
> client have counts/stats that are the cumulation of all shards, but termY 
> only has the contributions from shard1
> Important Notes:
>  * This scenerio can happen regardless of the amount of overrequest used. 
> Additional overrequest just increases the number of "extra" terms needed in 
> the index with "better" sort values then termX & termY in shard2
>  * {{sort: 'count asc'}} is not just an exceptional/pathelogical case:
>  ** any function sort where additional data provided shards during refinement 
> can cause a bucket to "sort worse" can also cause this problem.
>  ** Examples: {{sum(price_i) asc}} , {{min(price_i) desc}} , {{avg(price_i) 
> asc|desc}} , etc...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets

2018-07-05 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16533943#comment-16533943
 ] 

Hoss Man commented on SOLR-12343:
-

yonik: hold up ... i put this on the backburner because of SOLR-12516 (which 
i'm currently actively working on)

Fixing SOLR-1234 before SOLR-12516 will make SOLR-12516  a lot worse in the 
common case (it will stop returning the facet range "other" buckets completely 
since currently no code refines them at all)

> JSON Field Facet refinement can return incorrect counts/stats for sorted 
> buckets
> 
>
> Key: SOLR-12343
> URL: https://issues.apache.org/jira/browse/SOLR-12343
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Yonik Seeley
>Priority: Major
> Attachments: SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, 
> SOLR-12343.patch
>
>
> The way JSON Facet's simple refinement "re-sorts" buckets after refinement 
> can cause _refined_ buckets to be "bumped out" of the topN based on the 
> refined counts/stats depending on the sort - causing _unrefined_ buckets 
> originally discounted in phase#2 to bubble up into the topN and be returned 
> to clients *with inaccurate counts/stats*
> The simplest way to demonstrate this bug (in some data sets) is with a 
> {{sort: 'count asc'}} facet:
>  * assume shard1 returns termX & termY in phase#1 because they have very low 
> shard1 counts
>  ** but *not* returned at all by shard2, because these terms both have very 
> high shard2 counts.
>  * Assume termX has a slightly lower shard1 count then termY, such that:
>  ** termX "makes the cut" off for the limit=N topN buckets
>  ** termY does not make the cut, and is the "N+1" known bucket at the end of 
> phase#1
>  * termX then gets included in the phase#2 refinement request against shard2
>  ** termX now has a much higher _known_ total count then termY
>  ** the coordinator now sorts termX "worse" in the sorted list of buckets 
> then termY
>  ** which causes termY to bubble up into the topN
>  * termY is ultimately included in the final result _with incomplete 
> count/stat/sub-facet data_ instead of termX
>  ** this is all indepenent of the possibility that termY may actually have a 
> significantly higher total count then termX across the entire collection
>  ** the key problem is that all/most of the other terms returned to the 
> client have counts/stats that are the cumulation of all shards, but termY 
> only has the contributions from shard1
> Important Notes:
>  * This scenerio can happen regardless of the amount of overrequest used. 
> Additional overrequest just increases the number of "extra" terms needed in 
> the index with "better" sort values then termX & termY in shard2
>  * {{sort: 'count asc'}} is not just an exceptional/pathelogical case:
>  ** any function sort where additional data provided shards during refinement 
> can cause a bucket to "sort worse" can also cause this problem.
>  ** Examples: {{sum(price_i) asc}} , {{min(price_i) desc}} , {{avg(price_i) 
> asc|desc}} , etc...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets

2018-07-02 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530657#comment-16530657
 ] 

Yonik Seeley commented on SOLR-12343:
-

I think some of what I just worked on for SOLR-12326 is related to (or can be 
used by) this issue.
FacetRequestSortedMerger now has a "BitSet shardHasMoreBuckets" to help deal 
with the fact that complete buckets do not need participation from every shard. 
 That info in conjunction with Context.sawShard should be enough to tell if a 
bucket is already "complete".
For every bucket that isn't complete, we can either refine it, or drop it.


> JSON Field Facet refinement can return incorrect counts/stats for sorted 
> buckets
> 
>
> Key: SOLR-12343
> URL: https://issues.apache.org/jira/browse/SOLR-12343
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Priority: Major
> Attachments: SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch
>
>
> The way JSON Facet's simple refinement "re-sorts" buckets after refinement 
> can cause _refined_ buckets to be "bumped out" of the topN based on the 
> refined counts/stats depending on the sort - causing _unrefined_ buckets 
> originally discounted in phase#2 to bubble up into the topN and be returned 
> to clients *with inaccurate counts/stats*
> The simplest way to demonstrate this bug (in some data sets) is with a 
> {{sort: 'count asc'}} facet:
>  * assume shard1 returns termX & termY in phase#1 because they have very low 
> shard1 counts
>  ** but *not* returned at all by shard2, because these terms both have very 
> high shard2 counts.
>  * Assume termX has a slightly lower shard1 count then termY, such that:
>  ** termX "makes the cut" off for the limit=N topN buckets
>  ** termY does not make the cut, and is the "N+1" known bucket at the end of 
> phase#1
>  * termX then gets included in the phase#2 refinement request against shard2
>  ** termX now has a much higher _known_ total count then termY
>  ** the coordinator now sorts termX "worse" in the sorted list of buckets 
> then termY
>  ** which causes termY to bubble up into the topN
>  * termY is ultimately included in the final result _with incomplete 
> count/stat/sub-facet data_ instead of termX
>  ** this is all indepenent of the possibility that termY may actually have a 
> significantly higher total count then termX across the entire collection
>  ** the key problem is that all/most of the other terms returned to the 
> client have counts/stats that are the cumulation of all shards, but termY 
> only has the contributions from shard1
> Important Notes:
>  * This scenerio can happen regardless of the amount of overrequest used. 
> Additional overrequest just increases the number of "extra" terms needed in 
> the index with "better" sort values then termX & termY in shard2
>  * {{sort: 'count asc'}} is not just an exceptional/pathelogical case:
>  ** any function sort where additional data provided shards during refinement 
> can cause a bucket to "sort worse" can also cause this problem.
>  ** Examples: {{sum(price_i) asc}} , {{min(price_i) desc}} , {{avg(price_i) 
> asc|desc}} , etc...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets

2018-06-21 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16519673#comment-16519673
 ] 

Hoss Man commented on SOLR-12343:
-

{quote}Not sure if it relates to this bug...
{quote}
No, that's an unrelated stupid test mistake that i've fixed locally and am 
currently hammering on – but that's for pointing it out!

> JSON Field Facet refinement can return incorrect counts/stats for sorted 
> buckets
> 
>
> Key: SOLR-12343
> URL: https://issues.apache.org/jira/browse/SOLR-12343
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Priority: Major
> Attachments: SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch
>
>
> The way JSON Facet's simple refinement "re-sorts" buckets after refinement 
> can cause _refined_ buckets to be "bumped out" of the topN based on the 
> refined counts/stats depending on the sort - causing _unrefined_ buckets 
> originally discounted in phase#2 to bubble up into the topN and be returned 
> to clients *with inaccurate counts/stats*
> The simplest way to demonstrate this bug (in some data sets) is with a 
> {{sort: 'count asc'}} facet:
>  * assume shard1 returns termX & termY in phase#1 because they have very low 
> shard1 counts
>  ** but *not* returned at all by shard2, because these terms both have very 
> high shard2 counts.
>  * Assume termX has a slightly lower shard1 count then termY, such that:
>  ** termX "makes the cut" off for the limit=N topN buckets
>  ** termY does not make the cut, and is the "N+1" known bucket at the end of 
> phase#1
>  * termX then gets included in the phase#2 refinement request against shard2
>  ** termX now has a much higher _known_ total count then termY
>  ** the coordinator now sorts termX "worse" in the sorted list of buckets 
> then termY
>  ** which causes termY to bubble up into the topN
>  * termY is ultimately included in the final result _with incomplete 
> count/stat/sub-facet data_ instead of termX
>  ** this is all indepenent of the possibility that termY may actually have a 
> significantly higher total count then termX across the entire collection
>  ** the key problem is that all/most of the other terms returned to the 
> client have counts/stats that are the cumulation of all shards, but termY 
> only has the contributions from shard1
> Important Notes:
>  * This scenerio can happen regardless of the amount of overrequest used. 
> Additional overrequest just increases the number of "extra" terms needed in 
> the index with "better" sort values then termX & termY in shard2
>  * {{sort: 'count asc'}} is not just an exceptional/pathelogical case:
>  ** any function sort where additional data provided shards during refinement 
> can cause a bucket to "sort worse" can also cause this problem.
>  ** Examples: {{sum(price_i) asc}} , {{min(price_i) desc}} , {{avg(price_i) 
> asc|desc}} , etc...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets

2018-06-20 Thread Steve Rowe (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16518698#comment-16518698
 ] 

Steve Rowe commented on SOLR-12343:
---

Not sure if it relates to this bug -- please move/add if not -- but my Jenkins 
found a reproducing failure for {{TestCloudJSONFacetSKG.testBespoke()}}:

{noformat}
Checking out Revision 008bc74bebef96414f19118a267dbf982aba58b9 
(refs/remotes/origin/master)
[...]
ant test  -Dtestcase=TestCloudJSONFacetSKG -Dtests.method=testBespoke 
-Dtests.seed=5D223D88BF5BF89 -Dtests.slow=true -Dtests.locale=bg-BG 
-Dtests.timezone=America/Asuncion -Dtests.asserts=true 
-Dtests.file.encoding=ISO-8859-1
   [junit4] FAILURE 0.11s J0  | TestCloudJSONFacetSKG.testBespoke <<<
   [junit4]> Throwable #1: java.lang.AssertionError: Didn't check a single 
bucket???
   [junit4]>at 
__randomizedtesting.SeedInfo.seed([5D223D88BF5BF89:E09A7E14375787E]:0)
   [junit4]>at 
org.apache.solr.cloud.TestCloudJSONFacetSKG.testBespoke(TestCloudJSONFacetSKG.java:219)
   [junit4]>at java.lang.Thread.run(Thread.java:748)
[...]
   [junit4]   2> NOTE: test params are: 
codec=FastCompressingStoredFields(storedFieldsFormat=CompressingStoredFieldsFormat(compressionMode=FAST,
 chunkSize=4, maxDocsPerChunk=1, blockSize=332), 
termVectorsFormat=CompressingTermVectorsFormat(compressionMode=FAST, 
chunkSize=4, blockSize=332)), 
sim=Asserting(org.apache.lucene.search.similarities.AssertingSimilarity@4052d535),
 locale=el, timezone=Indian/Antananarivo
   [junit4]   2> NOTE: Linux 4.1.0-custom2-amd64 amd64/Oracle Corporation 
1.8.0_151 (64-bit)/cpus=16,threads=1,free=213710424,total=526909440
{noformat}

> JSON Field Facet refinement can return incorrect counts/stats for sorted 
> buckets
> 
>
> Key: SOLR-12343
> URL: https://issues.apache.org/jira/browse/SOLR-12343
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Priority: Major
> Attachments: SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch
>
>
> The way JSON Facet's simple refinement "re-sorts" buckets after refinement 
> can cause _refined_ buckets to be "bumped out" of the topN based on the 
> refined counts/stats depending on the sort - causing _unrefined_ buckets 
> originally discounted in phase#2 to bubble up into the topN and be returned 
> to clients *with inaccurate counts/stats*
> The simplest way to demonstrate this bug (in some data sets) is with a 
> {{sort: 'count asc'}} facet:
>  * assume shard1 returns termX & termY in phase#1 because they have very low 
> shard1 counts
>  ** but *not* returned at all by shard2, because these terms both have very 
> high shard2 counts.
>  * Assume termX has a slightly lower shard1 count then termY, such that:
>  ** termX "makes the cut" off for the limit=N topN buckets
>  ** termY does not make the cut, and is the "N+1" known bucket at the end of 
> phase#1
>  * termX then gets included in the phase#2 refinement request against shard2
>  ** termX now has a much higher _known_ total count then termY
>  ** the coordinator now sorts termX "worse" in the sorted list of buckets 
> then termY
>  ** which causes termY to bubble up into the topN
>  * termY is ultimately included in the final result _with incomplete 
> count/stat/sub-facet data_ instead of termX
>  ** this is all indepenent of the possibility that termY may actually have a 
> significantly higher total count then termX across the entire collection
>  ** the key problem is that all/most of the other terms returned to the 
> client have counts/stats that are the cumulation of all shards, but termY 
> only has the contributions from shard1
> Important Notes:
>  * This scenerio can happen regardless of the amount of overrequest used. 
> Additional overrequest just increases the number of "extra" terms needed in 
> the index with "better" sort values then termX & termY in shard2
>  * {{sort: 'count asc'}} is not just an exceptional/pathelogical case:
>  ** any function sort where additional data provided shards during refinement 
> can cause a bucket to "sort worse" can also cause this problem.
>  ** Examples: {{sum(price_i) asc}} , {{min(price_i) desc}} , {{avg(price_i) 
> asc|desc}} , etc...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets

2018-06-18 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16516506#comment-16516506
 ] 

Hoss Man commented on SOLR-12343:
-


Updated patch with more tests and some code tweaks based on a few things the 
new tests caught.

Still outstanding is the question of the new BitSets I added...

{quote}
* buckets now keep track of how many shards contributed to them ...
** there's a nocommit in here about the possibility of re-using the 
{{Context.sawShard}} BitSet instead – but i couldn't wrap my head around an 
efficient way to do it so i punted
* ...buckets are excluded if a bucket doesn't have contributions from as many 
shards as the FacetField...
** again, i needed a new BitSet in at the FacetField level to count the shards 
– because Context.numShards may include shards that never return any results 
for the facet (ie: empty shard) so they never merge any data at all)
{quote}

I _think_ it should be possible to re-implement the 
{{FacetBucket.getNumShardsMerged()}} method (i added) using 
{{Context.sawShard}} by using {{sawShard.get(bucketNum * numShards, bucketNum * 
numShards + numShards)}} to take a "slice" of the BitSet just for the current 
bucket and then look at it's cardinality.  the added cost of taking the slice 
only for buckets being considered in sorted order is probably a better trade 
off them the overhead of creating a new BitSet for every FacetBucket even if 
they are never considered for the response.

But I still don't see anyway to efficiently figure out the "shards that 
participated" info needed at the {{FacetField}} level using the existing 
{{sawShard}} BitSet -- particularly with the changes I had to make to account 
for the case where a shard has docs participating in a facet, but not matching 
any buckets (see 
{{testSortedSubFacetRefinementWhenParentOnlyReturnedByOneShard}} ).  
Fortunately that's just one new BitSet per FacetField instance (not per bucket).



I'll look at refactoring {{FacetBucket.getNumShardsMerged()}} to use 
{{Context.sawShard}} soon.


> JSON Field Facet refinement can return incorrect counts/stats for sorted 
> buckets
> 
>
> Key: SOLR-12343
> URL: https://issues.apache.org/jira/browse/SOLR-12343
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Priority: Major
> Attachments: SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch
>
>
> The way JSON Facet's simple refinement "re-sorts" buckets after refinement 
> can cause _refined_ buckets to be "bumped out" of the topN based on the 
> refined counts/stats depending on the sort - causing _unrefined_ buckets 
> originally discounted in phase#2 to bubble up into the topN and be returned 
> to clients *with inaccurate counts/stats*
> The simplest way to demonstrate this bug (in some data sets) is with a 
> {{sort: 'count asc'}} facet:
>  * assume shard1 returns termX & termY in phase#1 because they have very low 
> shard1 counts
>  ** but *not* returned at all by shard2, because these terms both have very 
> high shard2 counts.
>  * Assume termX has a slightly lower shard1 count then termY, such that:
>  ** termX "makes the cut" off for the limit=N topN buckets
>  ** termY does not make the cut, and is the "N+1" known bucket at the end of 
> phase#1
>  * termX then gets included in the phase#2 refinement request against shard2
>  ** termX now has a much higher _known_ total count then termY
>  ** the coordinator now sorts termX "worse" in the sorted list of buckets 
> then termY
>  ** which causes termY to bubble up into the topN
>  * termY is ultimately included in the final result _with incomplete 
> count/stat/sub-facet data_ instead of termX
>  ** this is all indepenent of the possibility that termY may actually have a 
> significantly higher total count then termX across the entire collection
>  ** the key problem is that all/most of the other terms returned to the 
> client have counts/stats that are the cumulation of all shards, but termY 
> only has the contributions from shard1
> Important Notes:
>  * This scenerio can happen regardless of the amount of overrequest used. 
> Additional overrequest just increases the number of "extra" terms needed in 
> the index with "better" sort values then termX & termY in shard2
>  * {{sort: 'count asc'}} is not just an exceptional/pathelogical case:
>  ** any function sort where additional data provided shards during refinement 
> can cause a bucket to "sort worse" can also cause this problem.
>  ** Examples: {{sum(price_i) asc}} , {{min(price_i) desc}} , {{avg(price_i) 
> asc|desc}} , etc...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets

2018-05-24 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16490020#comment-16490020
 ] 

Hoss Man commented on SOLR-12343:
-

{quote}there is a new {{overrefine:N}} option which works similar to 
overrequest – but instead of determining how many "extra" terms to request in 
phase#1, it determines how many "extra" buckets should be in 
{{numBucketsToCheck}} for refinement in phase #2 ...
{quote}
It occurs to me now, that adding this option should also provide a "solution" 
for SOLR-11733 ... people who are concerned about refining long tail terms can 
set {{overrefine}} really high.

> JSON Field Facet refinement can return incorrect counts/stats for sorted 
> buckets
> 
>
> Key: SOLR-12343
> URL: https://issues.apache.org/jira/browse/SOLR-12343
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Priority: Major
> Attachments: SOLR-12343.patch, SOLR-12343.patch
>
>
> The way JSON Facet's simple refinement "re-sorts" buckets after refinement 
> can cause _refined_ buckets to be "bumped out" of the topN based on the 
> refined counts/stats depending on the sort - causing _unrefined_ buckets 
> originally discounted in phase#2 to bubble up into the topN and be returned 
> to clients *with inaccurate counts/stats*
> The simplest way to demonstrate this bug (in some data sets) is with a 
> {{sort: 'count asc'}} facet:
>  * assume shard1 returns termX & termY in phase#1 because they have very low 
> shard1 counts
>  ** but *not* returned at all by shard2, because these terms both have very 
> high shard2 counts.
>  * Assume termX has a slightly lower shard1 count then termY, such that:
>  ** termX "makes the cut" off for the limit=N topN buckets
>  ** termY does not make the cut, and is the "N+1" known bucket at the end of 
> phase#1
>  * termX then gets included in the phase#2 refinement request against shard2
>  ** termX now has a much higher _known_ total count then termY
>  ** the coordinator now sorts termX "worse" in the sorted list of buckets 
> then termY
>  ** which causes termY to bubble up into the topN
>  * termY is ultimately included in the final result _with incomplete 
> count/stat/sub-facet data_ instead of termX
>  ** this is all indepenent of the possibility that termY may actually have a 
> significantly higher total count then termX across the entire collection
>  ** the key problem is that all/most of the other terms returned to the 
> client have counts/stats that are the cumulation of all shards, but termY 
> only has the contributions from shard1
> Important Notes:
>  * This scenerio can happen regardless of the amount of overrequest used. 
> Additional overrequest just increases the number of "extra" terms needed in 
> the index with "better" sort values then termX & termY in shard2
>  * {{sort: 'count asc'}} is not just an exceptional/pathelogical case:
>  ** any function sort where additional data provided shards during refinement 
> can cause a bucket to "sort worse" can also cause this problem.
>  ** Examples: {{sum(price_i) asc}} , {{min(price_i) desc}} , {{avg(price_i) 
> asc|desc}} , etc...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets

2018-05-23 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487839#comment-16487839
 ] 

Hoss Man commented on SOLR-12343:
-

{quote}... I think it should just be considered a bug.
{quote}
That's pretty much my feeling, but I wasn't sure.
{quote}Truncating the list of buckets to N before the refinement phase would 
fix the bug, but it would also throw away complete buckets that could make it 
into the top N after refinement.
{quote}
oh right ... yeah, i was forgetting about buckets that got data from all shards 
in phase #1.
{quote}Exactly which buckets we chose to refine (and exactly how many) can 
remain an implementation detail. ...
{quote}
right ... it can be heuristically determined, and very conservative in cases 
where we know it doesn't matter – but i still think there should be an explicit 
option...

I worked up a patch similar to the straw man i outlined above – except that i 
didn't add the {{refine:required}} variant since we're in agreement that this 
is a bug.

In the new patch:
 * buckets now keep track of how many shards contributed to them
 ** I did this with a quick and dirty BitSet instead of an {{int 
numShardsContributing}} counter since we have to handle the possibility that 
{{mergeBuckets()}} will get called more then once for a single shard when we 
have partial refinement of sub-facets
 ** there's a nocommit in here about the possibility of re-using the 
{{Context.sawShard}} BitSet instead – but i couldn't wrap my head around an 
efficient way to do it so i punted
 * during the final "pruning" in {{FacetFieldMerger.getMergedResult()}} buckets 
are excluded if a bucket doesn't have contributions from as many shards as the 
FacetField
 ** again, i needed a new BitSet in at the FacetField level to count the shards 
– because {{Context.numShards}} may include shards that never return *any* 
results for the facet (ie: empty shard) so they never merge any data at all)
 * there is a new {{overrefine:N}} option which works similar to overrequest – 
but instead of determining how many "extra" terms to request in phase#1, it 
determines how many "extra" buckets should be in {{numBucketsToCheck}} for 
refinement in phase #2 (but if some buckets are already fully populated in 
phase #2, then the actual number "refined" in phase#2 can be lower then 
limit+overrefine)
 ** the default hueristic currently pays attention to the sort – since (IIUC) 
{{count desc}} and {{index asc|desc}} should never need any "over refinement" 
unless {{mincount > 1}}
 ** if we have a non-trivial sort, and the user specified an explicit 
{{overrequest:N}} then the default hueristic for {{overrefine}} uses the same 
value {{N}}
 *** because i'm assuming if people have explicitly requested {{sort:SPECIAL, 
refine:true, overrequest:N}} then they care about the accuracy of the the terms 
to some degree N, and the bigger N is the more we should care about 
over-refinement as well.
 ** if neither {{overrequest}} or {{overrefine}} are explicitly set, then we 
use the same {{limit * 1.1 + 4}} type hueristic as {{overrequest}}
 ** there's another nocommit here though: if we're using a hueritic, should we 
be scaling the derived {{numBucketsToCheck}} based on {{mincount}} ? ... if 
{{mincount=M > 1}} should we be doing something like {{numBucketsToCheck *= M}} 
??
 *** although, thinking about it now – this kind of mincount based factor would 
probably make more sense in the {{overrequest}} hueristic? maybe for 
{{overrefine}} we should look at how many buckets were already fully populated 
in phase#1 _AND_ meet the mincount, and use the the difference between that 
number and the limit to decide a scaling factor?
 *** either way: can probably TODO this for a future enhancement.
 * Testing wise...
 ** These changes fix the problems in previous test patch
 ** I've also added some more tests, but there's nocommit's to add a lot more 
including verification of nested facets
 ** I didn't want to go too deep down the testing rabbit hole until i was sure 
we wanted to go this route.

what do you think?

> JSON Field Facet refinement can return incorrect counts/stats for sorted 
> buckets
> 
>
> Key: SOLR-12343
> URL: https://issues.apache.org/jira/browse/SOLR-12343
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Priority: Major
> Attachments: SOLR-12343.patch
>
>
> The way JSON Facet's simple refinement "re-sorts" buckets after refinement 
> can cause _refined_ buckets to be "bumped out" of the topN based on the 
> refined counts/stats depending on the sort - causing _unrefined_ buckets 
> originally discounted in phase#2 to bubble up into the topN and be returned 
> to clients *with 

[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets

2018-05-18 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481366#comment-16481366
 ] 

Yonik Seeley commented on SOLR-12343:
-

I think the most important thing here is that individual buckets should have 
correct stats.  The behavior uncovered here was not intentional and isn't 
useful, so I think it should just be considered a bug.

Truncating the list of buckets to N before the refinement phase would fix the 
bug, but it would also throw away complete buckets that could make it into the 
top N after refinement.  One could tweak to only throw away incomplete buckets 
after the top N, but that still leaves the filtering complications you brought 
up. In the long term, perhaps a cursorMark approach would work better in 
conjunction with filtering? Although it does feel like paging facets is a less 
important feature in general.

 Exactly which buckets we chose to refine (and exactly how many) can remain an 
implementation detail. The essence of the simple refinement algorithm is:
 1) collect top buckets from each shard
 2) refine some subset of those buckets (refinement == ensure every shard that 
can contribute to that bucket has)
 3) return only refined buckets

 

> JSON Field Facet refinement can return incorrect counts/stats for sorted 
> buckets
> 
>
> Key: SOLR-12343
> URL: https://issues.apache.org/jira/browse/SOLR-12343
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Priority: Major
> Attachments: SOLR-12343.patch
>
>
> The way JSON Facet's simple refinement "re-sorts" buckets after refinement 
> can cause _refined_ buckets to be "bumped out" of the topN based on the 
> refined counts/stats depending on the sort - causing _unrefined_ buckets 
> originally discounted in phase#2 to bubble up into the topN and be returned 
> to clients *with inaccurate counts/stats*
> The simplest way to demonstrate this bug (in some data sets) is with a 
> {{sort: 'count asc'}} facet:
>  * assume shard1 returns termX & termY in phase#1 because they have very low 
> shard1 counts
>  ** but *not* returned at all by shard2, because these terms both have very 
> high shard2 counts.
>  * Assume termX has a slightly lower shard1 count then termY, such that:
>  ** termX "makes the cut" off for the limit=N topN buckets
>  ** termY does not make the cut, and is the "N+1" known bucket at the end of 
> phase#1
>  * termX then gets included in the phase#2 refinement request against shard2
>  ** termX now has a much higher _known_ total count then termY
>  ** the coordinator now sorts termX "worse" in the sorted list of buckets 
> then termY
>  ** which causes termY to bubble up into the topN
>  * termY is ultimately included in the final result _with incomplete 
> count/stat/sub-facet data_ instead of termX
>  ** this is all indepenent of the possibility that termY may actually have a 
> significantly higher total count then termX across the entire collection
>  ** the key problem is that all/most of the other terms returned to the 
> client have counts/stats that are the cumulation of all shards, but termY 
> only has the contributions from shard1
> Important Notes:
>  * This scenerio can happen regardless of the amount of overrequest used. 
> Additional overrequest just increases the number of "extra" terms needed in 
> the index with "better" sort values then termX & termY in shard2
>  * {{sort: 'count asc'}} is not just an exceptional/pathelogical case:
>  ** any function sort where additional data provided shards during refinement 
> can cause a bucket to "sort worse" can also cause this problem.
>  ** Examples: {{sum(price_i) asc}} , {{min(price_i) desc}} , {{avg(price_i) 
> asc|desc}} , etc...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets

2018-05-10 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471335#comment-16471335
 ] 

Hoss Man commented on SOLR-12343:
-

My initial thinking was that {{FacetRequestSortedMerger.sortBuckets()}} should 
go ahead and truncate the list of bucket based on the {{limit+offset}} as the 
very last thing it does – for the "pre-refinement" call to sortBuckets() this 
wouldn't change anything about the buckets selected for refinement, and for the 
"post-refinement" call to sortBuckets() it would only change the order of the 
buckets already refined – bug goes away. There's even a comment on the 
pre-refinement call that says {{// todo: make sure this filters buckets as 
well}} which seemed to be directly on point.

Except... looking at the post-refinement use of sortBuckets() in 
FacetFieldMerger, I realize that the {{mincount}} type "filtering" (which is 
probably what that '{{// todo'}} actually refered to) needs to be applied 
*after* the buckets are sorting, but before pruning down down bsad on the 
offset+limit.

With something like {{count desc}} it wouldn't matter if we "pre-truncate" the 
list, because if any of the refined buckets don't have a count>mincount, then 
there's no chance any of the un-refined buckets will satisfy that mincount 
either ... but for things like {{index asc|desc}} or sorting by functions: it 
definitely matters in order to ensure we return the full "limit" # of buckets.

Although, I guess a key question i have is: if the user has explicitly 
requested refinement, then is there really any value in returning the full 
"limit" # of buckets if some of those buckets aren't refined?

That really seems like the crux of this bug: to me, it seems like when 
refinement is requested we should *NEVER* return an unrefined bucket (ie: a 
bucket that is lying about it's count/stats) ... but I can imagine other folks 
might feel differently.

Anyone have strong opinions?

 

For now, I'll assume the current behavior is considered desirable by some, and 
brain storm potential enhancements to make it optional...

Perhaps we should add a new {{refine:required}} variant? If the user says 
refinement is required, then {{sortBuckets()}} could pre-truncate.

Or maybe better still:
 * we add an {{int numShardsContributing = 1}} to {{FacetBucket}} that gets 
incremented every time a shard is merged in.
 * Add the new {{refine:required}} option but implemented differently...
 ** {{sortBuckets()}} doesn't change – leave all the un-refined buckets in 
{{sortedBuckets}} all the the time
 ** Consumers of {{sortedBuckets}} (like {{FacetFieldMerger.getMergedResult()}} 
) are responsible for checking the type of refinement:
 *** if it was {{required}} , then filter the buckets on 
{{numShardsContributing}} just like the existing filtering on mincount in the 
same loop
 * *Additionally:* add a new {{overrefine:N}} option that can be use in 
conjunction with, or independently from {{refine:required}}
 ** Default to '0' for back compat
 ** used during refinement phase similar to how "overrequest" is used during 
the initial request
 *** ie: {{FacetRequestSortedMerger}} would add it to the limit when computing 
{{numBucketsToCheck}}

This way, clients that are willing to "pay extra" during refinement can request 
that additional terms get refined – which can be useful for non-trivial sorts 
to ensure that the "best" buckets really are returned.  Independently clients 
can indicate if they are unwilling to accept un-refined buckets in the response 
because they care about accuracy, or would rather have as many buckets (up to 
limit) returned as possible, even if they couldn't be refined.

 

What do folks think?

[~yo...@apache.org] do you see any problems with this approach? or have 
alternative suggestions?

 

 

> JSON Field Facet refinement can return incorrect counts/stats for sorted 
> buckets
> 
>
> Key: SOLR-12343
> URL: https://issues.apache.org/jira/browse/SOLR-12343
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Priority: Major
> Attachments: SOLR-12343.patch
>
>
> The way JSON Facet's simple refinement "re-sorts" buckets after refinement 
> can cause _refined_ buckets to be "bumped out" of the topN based on the 
> refined counts/stats depending on the sort - causing _unrefined_ buckets 
> originally discounted in phase#2 to bubble up into the topN and be returned 
> to clients *with inaccurate counts/stats*
> The simplest way to demonstrate this bug (in some data sets) is with a 
> {{sort: 'count asc'}} facet:
>  * assume shard1 returns termX & termY in phase#1 because they have very low 
> shard1 counts
>  ** but *not* returned at all by shard2, because 

[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets

2018-05-10 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471156#comment-16471156
 ] 

Hoss Man commented on SOLR-12343:
-

Ultimately what seems to be at issue here is a discrepency between how Yonik 
designed the "simple" facet algorithm, and how it's implemented – but its only 
problematic in these "additional information from refinement can make sort 
values 'worse'" type situations.

As Yonik noted in SOLR-11733 regarding the design of {{refine:simple|true}} ...
{quote}[compared to facet.field] ...the refinement algorithm being different 
(and for a single-level facet field, simpler).
 It can be explained as:
 1) find buckets to return as if you weren't doing refinement
 2) for those buckets, make sure all shards have contributed to the statistics
 i.e. simple refinement doesn't change the buckets you get back.
{quote}
But in actuality, adding {{refine:true}} _can_ change the buckets you get back. 
In my example above, if {{refine:false}} was used, termX would have ultimately 
been returned (with an unrefined count) – but because of refinement it's not 
returned, and termY is returned in it's place.

I've attached a simple test patch demonstrating the problem but I haven't yet 
dug into the code to figure out the best fix.

I _suspect_ what's needed (to stick to the intent of {{refine:simple}} ) is 
that after the coordinator picks buckets that need refined, it should prune 
down the list of "all known" (size {{limit=N + overrequest=R}}) buckets to just 
the "buckets to return" (size {{limit=N}}) so that once the refinement values 
come in the _set_ of buckets desn't change, even if the _order_ or the buckets 
does.

> JSON Field Facet refinement can return incorrect counts/stats for sorted 
> buckets
> 
>
> Key: SOLR-12343
> URL: https://issues.apache.org/jira/browse/SOLR-12343
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Priority: Major
> Attachments: SOLR-12343.patch
>
>
> The way JSON Facet's simple refinement "re-sorts" buckets after refinement 
> can cause _refined_ buckets to be "bumped out" of the topN based on the 
> refined counts/stats depending on the sort - causing _unrefined_ buckets 
> originally discounted in phase#2 to bubble up into the topN and be returned 
> to clients *with inaccurate counts/stats*
> The simplest way to demonstrate this bug (in some data sets) is with a 
> {{sort: 'count asc'}} facet:
>  * assume shard1 returns termX & termY in phase#1 because they have very low 
> shard1 counts
>  ** but *not* returned at all by shard2, because these terms both have very 
> high shard2 counts.
>  * Assume termX has a slightly lower shard1 count then termY, such that:
>  ** termX "makes the cut" off for the limit=N topN buckets
>  ** termY does not make the cut, and is the "N+1" known bucket at the end of 
> phase#1
>  * termX then gets included in the phase#2 refinement request against shard2
>  ** termX now has a much higher _known_ total count then termY
>  ** the coordinator now sorts termX "worse" in the sorted list of buckets 
> then termY
>  ** which causes termY to bubble up into the topN
>  * termY is ultimately included in the final result _with incomplete 
> count/stat/sub-facet data_ instead of termX
>  ** this is all indepenent of the possibility that termY may actually have a 
> significantly higher total count then termX across the entire collection
>  ** the key problem is that all/most of the other terms returned to the 
> client have counts/stats that are the cumulation of all shards, but termY 
> only has the contributions from shard1
> Important Notes:
>  * This scenerio can happen regardless of the amount of overrequest used. 
> Additional overrequest just increases the number of "extra" terms needed in 
> the index with "better" sort values then termX & termY in shard2
>  * {{sort: 'count asc'}} is not just an exceptional/pathelogical case:
>  ** any function sort where additional data provided shards during refinement 
> can cause a bucket to "sort worse" can also cause this problem.
>  ** Examples: {{sum(price_i) asc}} , {{min(price_i) desc}} , {{avg(price_i) 
> asc|desc}} , etc...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org