[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets
[ https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16549603#comment-16549603 ] ASF subversion and git services commented on SOLR-12343: Commit 3a5d4a25df310d2021fa947ea593cc9b3c93a386 in lucene-solr's branch refs/heads/master from Chris Hostetter [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3a5d4a2 ] SOLR-12343: Fixed a bug in JSON Faceting that could cause incorrect counts/stats when using non default sort options This also adds a new configurable "overrefine" option > JSON Field Facet refinement can return incorrect counts/stats for sorted > buckets > > > Key: SOLR-12343 > URL: https://issues.apache.org/jira/browse/SOLR-12343 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Yonik Seeley >Priority: Major > Attachments: SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, > SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, > SOLR-12343.patch, SOLR-12343.patch, __incomplete_processEmpty_microfix.patch > > > The way JSON Facet's simple refinement "re-sorts" buckets after refinement > can cause _refined_ buckets to be "bumped out" of the topN based on the > refined counts/stats depending on the sort - causing _unrefined_ buckets > originally discounted in phase#2 to bubble up into the topN and be returned > to clients *with inaccurate counts/stats* > The simplest way to demonstrate this bug (in some data sets) is with a > {{sort: 'count asc'}} facet: > * assume shard1 returns termX & termY in phase#1 because they have very low > shard1 counts > ** but *not* returned at all by shard2, because these terms both have very > high shard2 counts. > * Assume termX has a slightly lower shard1 count then termY, such that: > ** termX "makes the cut" off for the limit=N topN buckets > ** termY does not make the cut, and is the "N+1" known bucket at the end of > phase#1 > * termX then gets included in the phase#2 refinement request against shard2 > ** termX now has a much higher _known_ total count then termY > ** the coordinator now sorts termX "worse" in the sorted list of buckets > then termY > ** which causes termY to bubble up into the topN > * termY is ultimately included in the final result _with incomplete > count/stat/sub-facet data_ instead of termX > ** this is all indepenent of the possibility that termY may actually have a > significantly higher total count then termX across the entire collection > ** the key problem is that all/most of the other terms returned to the > client have counts/stats that are the cumulation of all shards, but termY > only has the contributions from shard1 > Important Notes: > * This scenerio can happen regardless of the amount of overrequest used. > Additional overrequest just increases the number of "extra" terms needed in > the index with "better" sort values then termX & termY in shard2 > * {{sort: 'count asc'}} is not just an exceptional/pathelogical case: > ** any function sort where additional data provided shards during refinement > can cause a bucket to "sort worse" can also cause this problem. > ** Examples: {{sum(price_i) asc}} , {{min(price_i) desc}} , {{avg(price_i) > asc|desc}} , etc... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets
[ https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16549602#comment-16549602 ] ASF subversion and git services commented on SOLR-12343: Commit a7fe950074a834edc070c265df1394181b268683 in lucene-solr's branch refs/heads/branch_7x from Chris Hostetter [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a7fe950 ] SOLR-12343: Fixed a bug in JSON Faceting that could cause incorrect counts/stats when using non default sort options This also adds a new configurable "overrefine" option (cherry picked from commit 3a5d4a25df310d2021fa947ea593cc9b3c93a386) > JSON Field Facet refinement can return incorrect counts/stats for sorted > buckets > > > Key: SOLR-12343 > URL: https://issues.apache.org/jira/browse/SOLR-12343 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Yonik Seeley >Priority: Major > Attachments: SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, > SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, > SOLR-12343.patch, SOLR-12343.patch, __incomplete_processEmpty_microfix.patch > > > The way JSON Facet's simple refinement "re-sorts" buckets after refinement > can cause _refined_ buckets to be "bumped out" of the topN based on the > refined counts/stats depending on the sort - causing _unrefined_ buckets > originally discounted in phase#2 to bubble up into the topN and be returned > to clients *with inaccurate counts/stats* > The simplest way to demonstrate this bug (in some data sets) is with a > {{sort: 'count asc'}} facet: > * assume shard1 returns termX & termY in phase#1 because they have very low > shard1 counts > ** but *not* returned at all by shard2, because these terms both have very > high shard2 counts. > * Assume termX has a slightly lower shard1 count then termY, such that: > ** termX "makes the cut" off for the limit=N topN buckets > ** termY does not make the cut, and is the "N+1" known bucket at the end of > phase#1 > * termX then gets included in the phase#2 refinement request against shard2 > ** termX now has a much higher _known_ total count then termY > ** the coordinator now sorts termX "worse" in the sorted list of buckets > then termY > ** which causes termY to bubble up into the topN > * termY is ultimately included in the final result _with incomplete > count/stat/sub-facet data_ instead of termX > ** this is all indepenent of the possibility that termY may actually have a > significantly higher total count then termX across the entire collection > ** the key problem is that all/most of the other terms returned to the > client have counts/stats that are the cumulation of all shards, but termY > only has the contributions from shard1 > Important Notes: > * This scenerio can happen regardless of the amount of overrequest used. > Additional overrequest just increases the number of "extra" terms needed in > the index with "better" sort values then termX & termY in shard2 > * {{sort: 'count asc'}} is not just an exceptional/pathelogical case: > ** any function sort where additional data provided shards during refinement > can cause a bucket to "sort worse" can also cause this problem. > ** Examples: {{sum(price_i) asc}} , {{min(price_i) desc}} , {{avg(price_i) > asc|desc}} , etc... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets
[ https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546927#comment-16546927 ] Lucene/Solr QA commented on SOLR-12343: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 3 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 21s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 10s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 10s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} Release audit (RAT) {color} | {color:green} 2m 18s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} Check forbidden APIs {color} | {color:green} 2m 10s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} Validate source patterns {color} | {color:green} 2m 10s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} Validate ref guide {color} | {color:green} 2m 10s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}179m 55s{color} | {color:red} core in the patch failed. {color} | | {color:black}{color} | {color:black} {color} | {color:black}189m 14s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | solr.cloud.autoscaling.IndexSizeTriggerTest | | | solr.cloud.api.collections.ShardSplitTest | | | solr.cloud.autoscaling.sim.TestGenericDistributedQueue | | | solr.handler.component.InfixSuggestersTest | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | SOLR-12343 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12931845/SOLR-12343.patch | | Optional Tests | compile javac unit ratsources checkforbiddenapis validatesourcepatterns validaterefguide | | uname | Linux lucene1-us-west 3.13.0-88-generic #135-Ubuntu SMP Wed Jun 8 21:10:42 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | ant | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-SOLR-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh | | git revision | master / d730c8b | | ant | version: Apache Ant(TM) version 1.9.3 compiled on April 8 2014 | | Default Java | 1.8.0_172 | | unit | https://builds.apache.org/job/PreCommit-SOLR-Build/144/artifact/out/patch-unit-solr_core.txt | | Test Results | https://builds.apache.org/job/PreCommit-SOLR-Build/144/testReport/ | | modules | C: solr/core solr/solr-ref-guide U: solr | | Console output | https://builds.apache.org/job/PreCommit-SOLR-Build/144/console | | Powered by | Apache Yetus 0.7.0 http://yetus.apache.org | This message was automatically generated. > JSON Field Facet refinement can return incorrect counts/stats for sorted > buckets > > > Key: SOLR-12343 > URL: https://issues.apache.org/jira/browse/SOLR-12343 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Yonik Seeley >Priority: Major > Attachments: SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, > SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, > SOLR-12343.patch, SOLR-12343.patch, __incomplete_processEmpty_microfix.patch > > > The way JSON Facet's simple refinement "re-sorts" buckets after refinement > can cause _refined_ buckets to be "bumped out" of the topN based on the > refined counts/stats depending on the sort - causing _unrefined_ buckets > originally discounted in phase#2 to bubble up into the topN and be returned > to clients *with inaccurate counts/stats* > The simplest way to demonstrate this bug (in some data sets) is with a > {{sort: 'count asc'}} facet: > * assume shard1 returns termX & termY in phase#1 because they have very low > shard1 counts > ** but *not* returned at all by shard2, because these terms both have very > high shard2 counts. > * Assume termX has a slightly lower shard1 count then termY, such that: > ** termX "makes the cut" off for the limit=N topN buckets > ** termY does not make the cut, and is the "N+1" known bucket at the end of > phase#1 > * termX then gets included in the
[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets
[ https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545604#comment-16545604 ] Hoss Man commented on SOLR-12343: - {quote}but then, once all the refinement is done, and we have a fully refined bucketX it might now sort "lower" then an incomplete bucketY ... and {{isBucketComplete}} doesn't pay any attention to {{processEmpty:true}} ... so it sees that shardA does *not* have {{more:true}} and thinks (the incomplete) bucketY is ok to return. {quote} I haven't been able to come up with a better solution for this, and since processEmpty is pretty special case, I think i'm just going to break it out into it's own Jira, and revise the patch so that the current assertion failures are confined to test methods that are \@AwaitsFix'ed on that issue -- that way we can move forward with the existing fix that likely impacts more people. > JSON Field Facet refinement can return incorrect counts/stats for sorted > buckets > > > Key: SOLR-12343 > URL: https://issues.apache.org/jira/browse/SOLR-12343 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Yonik Seeley >Priority: Major > Attachments: SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, > SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, > __incomplete_processEmpty_microfix.patch > > > The way JSON Facet's simple refinement "re-sorts" buckets after refinement > can cause _refined_ buckets to be "bumped out" of the topN based on the > refined counts/stats depending on the sort - causing _unrefined_ buckets > originally discounted in phase#2 to bubble up into the topN and be returned > to clients *with inaccurate counts/stats* > The simplest way to demonstrate this bug (in some data sets) is with a > {{sort: 'count asc'}} facet: > * assume shard1 returns termX & termY in phase#1 because they have very low > shard1 counts > ** but *not* returned at all by shard2, because these terms both have very > high shard2 counts. > * Assume termX has a slightly lower shard1 count then termY, such that: > ** termX "makes the cut" off for the limit=N topN buckets > ** termY does not make the cut, and is the "N+1" known bucket at the end of > phase#1 > * termX then gets included in the phase#2 refinement request against shard2 > ** termX now has a much higher _known_ total count then termY > ** the coordinator now sorts termX "worse" in the sorted list of buckets > then termY > ** which causes termY to bubble up into the topN > * termY is ultimately included in the final result _with incomplete > count/stat/sub-facet data_ instead of termX > ** this is all indepenent of the possibility that termY may actually have a > significantly higher total count then termX across the entire collection > ** the key problem is that all/most of the other terms returned to the > client have counts/stats that are the cumulation of all shards, but termY > only has the contributions from shard1 > Important Notes: > * This scenerio can happen regardless of the amount of overrequest used. > Additional overrequest just increases the number of "extra" terms needed in > the index with "better" sort values then termX & termY in shard2 > * {{sort: 'count asc'}} is not just an exceptional/pathelogical case: > ** any function sort where additional data provided shards during refinement > can cause a bucket to "sort worse" can also cause this problem. > ** Examples: {{sum(price_i) asc}} , {{min(price_i) desc}} , {{avg(price_i) > asc|desc}} , etc... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets
[ https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16541000#comment-16541000 ] Lucene/Solr QA commented on SOLR-12343: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 7s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 41s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 41s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} Release audit (RAT) {color} | {color:green} 2m 55s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} Check forbidden APIs {color} | {color:green} 2m 41s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} Validate source patterns {color} | {color:red} 2m 41s{color} | {color:red} Validate source patterns validate-source-patterns failed {color} | | {color:red}-1{color} | {color:red} Validate ref guide {color} | {color:red} 2m 41s{color} | {color:red} Validate source patterns validate-source-patterns failed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 94m 19s{color} | {color:red} core in the patch failed. {color} | | {color:black}{color} | {color:black} {color} | {color:black}104m 44s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | solr.cloud.autoscaling.IndexSizeTriggerTest | | | solr.cloud.api.collections.ShardSplitTest | | | solr.search.facet.TestJsonFacetRefinement | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | SOLR-12343 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12931226/SOLR-12343.patch | | Optional Tests | compile javac unit ratsources checkforbiddenapis validatesourcepatterns validaterefguide | | uname | Linux lucene2-us-west.apache.org 4.4.0-112-generic #135-Ubuntu SMP Fri Jan 19 11:48:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | ant | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-SOLR-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh | | git revision | master / fe180bb | | ant | version: Apache Ant(TM) version 1.9.6 compiled on July 8 2015 | | Default Java | 1.8.0_172 | | Validate source patterns | https://builds.apache.org/job/PreCommit-SOLR-Build/143/artifact/out/patch-validate-source-patterns-root.txt | | Validate ref guide | https://builds.apache.org/job/PreCommit-SOLR-Build/143/artifact/out/patch-validate-source-patterns-root.txt | | unit | https://builds.apache.org/job/PreCommit-SOLR-Build/143/artifact/out/patch-unit-solr_core.txt | | Test Results | https://builds.apache.org/job/PreCommit-SOLR-Build/143/testReport/ | | modules | C: solr/core solr/solr-ref-guide U: solr | | Console output | https://builds.apache.org/job/PreCommit-SOLR-Build/143/console | | Powered by | Apache Yetus 0.7.0 http://yetus.apache.org | This message was automatically generated. > JSON Field Facet refinement can return incorrect counts/stats for sorted > buckets > > > Key: SOLR-12343 > URL: https://issues.apache.org/jira/browse/SOLR-12343 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Yonik Seeley >Priority: Major > Attachments: SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, > SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, > __incomplete_processEmpty_microfix.patch > > > The way JSON Facet's simple refinement "re-sorts" buckets after refinement > can cause _refined_ buckets to be "bumped out" of the topN based on the > refined counts/stats depending on the sort - causing _unrefined_ buckets > originally discounted in phase#2 to bubble up into the topN and be returned > to clients *with inaccurate counts/stats* > The simplest way to demonstrate this bug (in some data sets) is with a > {{sort: 'count asc'}} facet: > * assume shard1 returns termX & termY in phase#1 because they have very low > shard1 counts > ** but *not* returned at all by shard2, because these terms both have very > high shard2 counts. > *
[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets
[ https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16540576#comment-16540576 ] Lucene/Solr QA commented on SOLR-12343: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} compile {color} | {color:green} 13m 36s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} compile {color} | {color:green} 16m 20s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 16m 20s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} Release audit (RAT) {color} | {color:green} 17m 13s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} Check forbidden APIs {color} | {color:green} 16m 20s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} Validate source patterns {color} | {color:green} 16m 20s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} Validate ref guide {color} | {color:green} 16m 20s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}153m 34s{color} | {color:red} core in the patch failed. {color} | | {color:black}{color} | {color:black} {color} | {color:black}213m 21s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | solr.cloud.api.collections.TestCollectionsAPIViaSolrCloudCluster | | | solr.cloud.cdcr.CdcrBidirectionalTest | | | solr.cloud.autoscaling.sim.TestExecutePlanAction | | | solr.cloud.autoscaling.SearchRateTriggerIntegrationTest | | | solr.cloud.api.collections.ShardSplitTest | | | solr.cloud.autoscaling.sim.TestLargeCluster | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | SOLR-12343 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12930878/SOLR-12343.patch | | Optional Tests | compile javac unit ratsources checkforbiddenapis validatesourcepatterns validaterefguide | | uname | Linux lucene2-us-west.apache.org 4.4.0-112-generic #135-Ubuntu SMP Fri Jan 19 11:48:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | ant | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-SOLR-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh | | git revision | master / fe180bb | | ant | version: Apache Ant(TM) version 1.9.6 compiled on July 8 2015 | | Default Java | 1.8.0_172 | | unit | https://builds.apache.org/job/PreCommit-SOLR-Build/142/artifact/out/patch-unit-solr_core.txt | | Test Results | https://builds.apache.org/job/PreCommit-SOLR-Build/142/testReport/ | | modules | C: solr/core solr/solr-ref-guide U: solr | | Console output | https://builds.apache.org/job/PreCommit-SOLR-Build/142/console | | Powered by | Apache Yetus 0.7.0 http://yetus.apache.org | This message was automatically generated. > JSON Field Facet refinement can return incorrect counts/stats for sorted > buckets > > > Key: SOLR-12343 > URL: https://issues.apache.org/jira/browse/SOLR-12343 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Yonik Seeley >Priority: Major > Attachments: SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, > SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, > __incomplete_processEmpty_microfix.patch > > > The way JSON Facet's simple refinement "re-sorts" buckets after refinement > can cause _refined_ buckets to be "bumped out" of the topN based on the > refined counts/stats depending on the sort - causing _unrefined_ buckets > originally discounted in phase#2 to bubble up into the topN and be returned > to clients *with inaccurate counts/stats* > The simplest way to demonstrate this bug (in some data sets) is with a > {{sort: 'count asc'}} facet: > * assume shard1 returns termX & termY in phase#1 because they have very low > shard1 counts > ** but *not* returned at all by shard2, because these terms both have very > high shard2 counts. > * Assume termX has a slightly lower shard1 count then termY, such that: > ** termX "makes the cut" off for the limit=N topN buckets > ** termY does not make the cut, and is
[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets
[ https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16540330#comment-16540330 ] Hoss Man commented on SOLR-12343: - [~ysee...@gmail.com] - i've been testing this out with the SKG (relatedness()) function -- where i initially discovered bug -- and trying to remove the workarounds for this that are currently in TestCloudJSONFacetSKG (grep for SOLR-12343) but i'm seeing some failures that I think i've traced back to a mistake in isBucketComplete() that _only_ affects facets using {{processEmpty:true}} ... {panel} in {{getRefinement()}} you've got {{returnedAllBuckets}} taking into consideration {{processEmpty:true}} -- so that even if a shardA doesn't say it has {{more:true}} we will still send it candidate bucketX for refinement if we didn't explicitly {{saw}} bucketX on shardA. so far so good. but then, once all the refinement is done, and we have a fully refined bucketX it might now sort "lower" then an incomplete bucketY ... and {{isBucketComplete}} doesn't pay any attention to {{processEmpty:true}} ... so it sees that shardA does *not* have {{more:true}} and thinks (the incomplete) bucketY is ok to return. {panel} ...I'll work up an isolated test case > JSON Field Facet refinement can return incorrect counts/stats for sorted > buckets > > > Key: SOLR-12343 > URL: https://issues.apache.org/jira/browse/SOLR-12343 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Yonik Seeley >Priority: Major > Attachments: SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, > SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch > > > The way JSON Facet's simple refinement "re-sorts" buckets after refinement > can cause _refined_ buckets to be "bumped out" of the topN based on the > refined counts/stats depending on the sort - causing _unrefined_ buckets > originally discounted in phase#2 to bubble up into the topN and be returned > to clients *with inaccurate counts/stats* > The simplest way to demonstrate this bug (in some data sets) is with a > {{sort: 'count asc'}} facet: > * assume shard1 returns termX & termY in phase#1 because they have very low > shard1 counts > ** but *not* returned at all by shard2, because these terms both have very > high shard2 counts. > * Assume termX has a slightly lower shard1 count then termY, such that: > ** termX "makes the cut" off for the limit=N topN buckets > ** termY does not make the cut, and is the "N+1" known bucket at the end of > phase#1 > * termX then gets included in the phase#2 refinement request against shard2 > ** termX now has a much higher _known_ total count then termY > ** the coordinator now sorts termX "worse" in the sorted list of buckets > then termY > ** which causes termY to bubble up into the topN > * termY is ultimately included in the final result _with incomplete > count/stat/sub-facet data_ instead of termX > ** this is all indepenent of the possibility that termY may actually have a > significantly higher total count then termX across the entire collection > ** the key problem is that all/most of the other terms returned to the > client have counts/stats that are the cumulation of all shards, but termY > only has the contributions from shard1 > Important Notes: > * This scenerio can happen regardless of the amount of overrequest used. > Additional overrequest just increases the number of "extra" terms needed in > the index with "better" sort values then termX & termY in shard2 > * {{sort: 'count asc'}} is not just an exceptional/pathelogical case: > ** any function sort where additional data provided shards during refinement > can cause a bucket to "sort worse" can also cause this problem. > ** Examples: {{sum(price_i) asc}} , {{min(price_i) desc}} , {{avg(price_i) > asc|desc}} , etc... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets
[ https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16537904#comment-16537904 ] Yonik Seeley commented on SOLR-12343: - Looks good, thanks for tracking that down! > JSON Field Facet refinement can return incorrect counts/stats for sorted > buckets > > > Key: SOLR-12343 > URL: https://issues.apache.org/jira/browse/SOLR-12343 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Yonik Seeley >Priority: Major > Attachments: SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, > SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch > > > The way JSON Facet's simple refinement "re-sorts" buckets after refinement > can cause _refined_ buckets to be "bumped out" of the topN based on the > refined counts/stats depending on the sort - causing _unrefined_ buckets > originally discounted in phase#2 to bubble up into the topN and be returned > to clients *with inaccurate counts/stats* > The simplest way to demonstrate this bug (in some data sets) is with a > {{sort: 'count asc'}} facet: > * assume shard1 returns termX & termY in phase#1 because they have very low > shard1 counts > ** but *not* returned at all by shard2, because these terms both have very > high shard2 counts. > * Assume termX has a slightly lower shard1 count then termY, such that: > ** termX "makes the cut" off for the limit=N topN buckets > ** termY does not make the cut, and is the "N+1" known bucket at the end of > phase#1 > * termX then gets included in the phase#2 refinement request against shard2 > ** termX now has a much higher _known_ total count then termY > ** the coordinator now sorts termX "worse" in the sorted list of buckets > then termY > ** which causes termY to bubble up into the topN > * termY is ultimately included in the final result _with incomplete > count/stat/sub-facet data_ instead of termX > ** this is all indepenent of the possibility that termY may actually have a > significantly higher total count then termX across the entire collection > ** the key problem is that all/most of the other terms returned to the > client have counts/stats that are the cumulation of all shards, but termY > only has the contributions from shard1 > Important Notes: > * This scenerio can happen regardless of the amount of overrequest used. > Additional overrequest just increases the number of "extra" terms needed in > the index with "better" sort values then termX & termY in shard2 > * {{sort: 'count asc'}} is not just an exceptional/pathelogical case: > ** any function sort where additional data provided shards during refinement > can cause a bucket to "sort worse" can also cause this problem. > ** Examples: {{sum(price_i) asc}} , {{min(price_i) desc}} , {{avg(price_i) > asc|desc}} , etc... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets
[ https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16537306#comment-16537306 ] Hoss Man commented on SOLR-12343: - Ok ... fresh eyes and i see the problem. When {{final int overreq = 0}} we don't add any "filler" docs, which means that when the nested facet test happens, shardC0 and shardC1 disagree about the "top term" for the parent facet on the {{all_ss}} field -- shardC0 only knows about {{z_al}} while shardC1 has a tie between {{z_all} and {{some}} and {{some}} wins the tie due to index order -- so when that parent facet uses {{overrequest:0}} the initial merge logic doesn't have any contributions from shardC1 for the chosen {{all_ss:z_all}} bucket ... so it only knows to ask to refine the top3 child buckets it does know about (from shardC0): "A,B,C". If the parent facet uses any overrequest larger then 0, then it would get the {{all_ss:z_all}} bucket from shardC1 as well, and have some child buckets to consider to know that C is a bad candidate, and it should be refining X instead. On the flip side, when {{final int overreq = 1}} (or anything higher) the addition of even a few filler docs is enough to skew the {{all_ss}} term stats on shardC1, such that it *also* thinkgs {{z_all}} is the top term, so regardless of the amount of overrequest on the top facet, the phase #1 merge has buckets from both shards for the child facet to consider. I remember when i was writing this test, and i include the {{some}} terms the entire point was to stress the case where the 2 shards disagree about the "top" term term from the parent facet -- but apparently when adding the filler docs/terms randomization i broke that so that it's not always true, it only happens when there are no filler docs. But it also seems like an unfair test, because when they do disagree, there's no reason for hte merge logic to think X is a worthwhile term to refine. what mattes is that in this case, C is accurately refined I'm working up a test fix... > JSON Field Facet refinement can return incorrect counts/stats for sorted > buckets > > > Key: SOLR-12343 > URL: https://issues.apache.org/jira/browse/SOLR-12343 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Yonik Seeley >Priority: Major > Attachments: SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, > SOLR-12343.patch, SOLR-12343.patch > > > The way JSON Facet's simple refinement "re-sorts" buckets after refinement > can cause _refined_ buckets to be "bumped out" of the topN based on the > refined counts/stats depending on the sort - causing _unrefined_ buckets > originally discounted in phase#2 to bubble up into the topN and be returned > to clients *with inaccurate counts/stats* > The simplest way to demonstrate this bug (in some data sets) is with a > {{sort: 'count asc'}} facet: > * assume shard1 returns termX & termY in phase#1 because they have very low > shard1 counts > ** but *not* returned at all by shard2, because these terms both have very > high shard2 counts. > * Assume termX has a slightly lower shard1 count then termY, such that: > ** termX "makes the cut" off for the limit=N topN buckets > ** termY does not make the cut, and is the "N+1" known bucket at the end of > phase#1 > * termX then gets included in the phase#2 refinement request against shard2 > ** termX now has a much higher _known_ total count then termY > ** the coordinator now sorts termX "worse" in the sorted list of buckets > then termY > ** which causes termY to bubble up into the topN > * termY is ultimately included in the final result _with incomplete > count/stat/sub-facet data_ instead of termX > ** this is all indepenent of the possibility that termY may actually have a > significantly higher total count then termX across the entire collection > ** the key problem is that all/most of the other terms returned to the > client have counts/stats that are the cumulation of all shards, but termY > only has the contributions from shard1 > Important Notes: > * This scenerio can happen regardless of the amount of overrequest used. > Additional overrequest just increases the number of "extra" terms needed in > the index with "better" sort values then termX & termY in shard2 > * {{sort: 'count asc'}} is not just an exceptional/pathelogical case: > ** any function sort where additional data provided shards during refinement > can cause a bucket to "sort worse" can also cause this problem. > ** Examples: {{sum(price_i) asc}} , {{min(price_i) desc}} , {{avg(price_i) > asc|desc}} , etc... -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets
[ https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16536494#comment-16536494 ] Hoss Man commented on SOLR-12343: - Found one – it seems to be specific to the situation where {{overrequest==0}}, and the facet is nested under another facet? playing the with values of {{top_over}} and {{top_refine}} it doesn't seem to matter if parent facet is refined, but the key is wether the top facet also uses {{overrequest:0}} (fails) or {{overrequest:999}} (passes) {noformat} [junit4] 2> 9990 INFO (qtp1276305453-48) [x:collection1] o.a.s.c.S.Request [collection1] webapp=/solr path=/select params={df=text=false&_facet_={}=id=score=1048580=0=true=127.0.0.1:47372/solr/collection1=0=2=*:*={+all:{+type:terms,+field:all_ss,+limit:1,+refine:true,+overrequest:0+++,+facet:{+++cat_count:{+type:terms,+field:cat_s,+limit:3,+overrequest:0+++,+refine:true,+sort:'count+asc'+},+++cat_price:{+type:terms,+field:cat_s,+limit:3,+overrequest:0+++,+refine:true,+sort:'sum_p+asc',+facet:+{+sum_p:+'sum(price_i)'+}+}}+}+}=1531102182236=true=javabin} hits=9 status=0 QTime=17 [junit4] 2> 9994 INFO (qtp1276305453-49) [x:collection1] o.a.s.c.S.Request [collection1] webapp=/solr path=/select params={df=text=false&_facet_={"refine":{"all":{"_p":[["z_all",{"cat_count":{"_l":["A","B","C"]},"cat_price":{"_l":["A","B","C"]}}]]}}}=2097152=127.0.0.1:47372/solr/collection1=0=2=*:*={+all:{+type:terms,+field:all_ss,+limit:1,+refine:true,+overrequest:0+++,+facet:{+++cat_count:{+type:terms,+field:cat_s,+limit:3,+overrequest:0+++,+refine:true,+sort:'count+asc'+},+++cat_price:{+type:terms,+field:cat_s,+limit:3,+overrequest:0+++,+refine:true,+sort:'sum_p+asc',+facet:+{+sum_p:+'sum(price_i)'+}+}}+}+}=1531102182236=true=false=javabin} hits=9 status=0 QTime=1 [junit4] 2> 9996 INFO (qtp1503674478-65) [x:collection1] o.a.s.c.S.Request [collection1] webapp=/solr path=/select params={shards=127.0.0.1:54950/solr/collection1,127.0.0.1:47372/solr/collection1,127.0.0.1:52833/solr/collection1=debugQuery=true=*:*={+all:{+type:terms,+field:all_ss,+limit:1,+refine:true,+overrequest:0+++,+facet:{+++cat_count:{+type:terms,+field:cat_s,+limit:3,+overrequest:0+++,+refine:true,+sort:'count+asc'+},+++cat_price:{+type:terms,+field:cat_s,+limit:3,+overrequest:0+++,+refine:true,+sort:'sum_p+asc',+facet:+{+sum_p:+'sum(price_i)'+}+}}+}+}=true=0=json=2.2} hits=19 status=0 QTime=25 [junit4] 2> 9997 ERROR (TEST-TestJsonFacetRefinement.testSortedFacetRefinementPushingNonRefinedBucketBackIntoTopN-seed#[775BF43EF8268D50]) [] o.a.s.SolrTestCaseHS query failed JSON validation. error=mismatch: 'X'!='C' @ facets/all/buckets/[0]/cat_count/buckets/[2]/val [junit4] 2> expected =facets=={ count: 19,all:{ buckets:[ { val:z_all, count: 19,cat_count:{ buckets:[ {val:A,count:1}, {val:B,count:1}, {val:X,count:4},] },cat_price:{ buckets:[ {val:A,count:1,sum_p:1.0}, {val:B,count:1,sum_p:1.0}, {val:X,count:4,sum_p:4.0},] }} ] } } [junit4] 2> response = { [junit4] 2> "responseHeader":{ [junit4] 2> "status":0, [junit4] 2> "QTime":25}, [junit4] 2> "response":{"numFound":19,"start":0,"maxScore":1.0,"docs":[] [junit4] 2> }, [junit4] 2> "facets":{ [junit4] 2> "count":19, [junit4] 2> "all":{ [junit4] 2> "buckets":[{ [junit4] 2> "val":"z_all", [junit4] 2> "count":19, [junit4] 2> "cat_price":{ [junit4] 2> "buckets":[{ [junit4] 2> "val":"A", [junit4] 2> "count":1, [junit4] 2> "sum_p":1.0}, [junit4] 2> { [junit4] 2> "val":"B", [junit4] 2> "count":1, [junit4] 2> "sum_p":1.0}, [junit4] 2> { [junit4] 2> "val":"C", [junit4] 2> "count":6, [junit4] 2> "sum_p":6.0}]}, [junit4] 2> "cat_count":{ [junit4] 2> "buckets":[{ [junit4] 2> "val":"A", [junit4] 2> "count":1}, [junit4] 2> { [junit4] 2> "val":"B", [junit4] 2> "count":1}, [junit4] 2> { [junit4] 2> "val":"C", [junit4] 2> "count":6}]}}]}}} [junit4] 2> [junit4] 2> 1 INFO (TEST-TestJsonFacetRefinement.testSortedFacetRefinementPushingNonRefinedBucketBackIntoTopN-seed#[775BF43EF8268D50]) [] o.a.s.SolrTestCaseJ4 ###Ending testSortedFacetRefinementPushingNonRefinedBucketBackIntoTopN
[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets
[ https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16536481#comment-16536481 ] Hoss Man commented on SOLR-12343: - which assertion? stacktrace? reproduce line? .. does the seed actually reproduce? There's virtually no randomization in the test at all, except for the number of fillter termss/overrequest. If you're seeing you're seeing a seed that reproduces, it makes me wonder if there is an edge case / off by one error based on the number of buckets ... if the seed doesn't reproduce (reliably) then it makes me wonder if it's an edge case that has to do with with which order the shards respond (ie: how the merger initializes the datastructs that get merged) > JSON Field Facet refinement can return incorrect counts/stats for sorted > buckets > > > Key: SOLR-12343 > URL: https://issues.apache.org/jira/browse/SOLR-12343 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Yonik Seeley >Priority: Major > Attachments: SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, > SOLR-12343.patch, SOLR-12343.patch > > > The way JSON Facet's simple refinement "re-sorts" buckets after refinement > can cause _refined_ buckets to be "bumped out" of the topN based on the > refined counts/stats depending on the sort - causing _unrefined_ buckets > originally discounted in phase#2 to bubble up into the topN and be returned > to clients *with inaccurate counts/stats* > The simplest way to demonstrate this bug (in some data sets) is with a > {{sort: 'count asc'}} facet: > * assume shard1 returns termX & termY in phase#1 because they have very low > shard1 counts > ** but *not* returned at all by shard2, because these terms both have very > high shard2 counts. > * Assume termX has a slightly lower shard1 count then termY, such that: > ** termX "makes the cut" off for the limit=N topN buckets > ** termY does not make the cut, and is the "N+1" known bucket at the end of > phase#1 > * termX then gets included in the phase#2 refinement request against shard2 > ** termX now has a much higher _known_ total count then termY > ** the coordinator now sorts termX "worse" in the sorted list of buckets > then termY > ** which causes termY to bubble up into the topN > * termY is ultimately included in the final result _with incomplete > count/stat/sub-facet data_ instead of termX > ** this is all indepenent of the possibility that termY may actually have a > significantly higher total count then termX across the entire collection > ** the key problem is that all/most of the other terms returned to the > client have counts/stats that are the cumulation of all shards, but termY > only has the contributions from shard1 > Important Notes: > * This scenerio can happen regardless of the amount of overrequest used. > Additional overrequest just increases the number of "extra" terms needed in > the index with "better" sort values then termX & termY in shard2 > * {{sort: 'count asc'}} is not just an exceptional/pathelogical case: > ** any function sort where additional data provided shards during refinement > can cause a bucket to "sort worse" can also cause this problem. > ** Examples: {{sum(price_i) asc}} , {{min(price_i) desc}} , {{avg(price_i) > asc|desc}} , etc... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets
[ https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16536095#comment-16536095 ] Yonik Seeley commented on SOLR-12343: - I'm occasionally getting a failure in testSortedFacetRefinementPushingNonRefinedBucketBackIntoTopN I haven't tried digging into it yet though. > JSON Field Facet refinement can return incorrect counts/stats for sorted > buckets > > > Key: SOLR-12343 > URL: https://issues.apache.org/jira/browse/SOLR-12343 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Yonik Seeley >Priority: Major > Attachments: SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, > SOLR-12343.patch, SOLR-12343.patch > > > The way JSON Facet's simple refinement "re-sorts" buckets after refinement > can cause _refined_ buckets to be "bumped out" of the topN based on the > refined counts/stats depending on the sort - causing _unrefined_ buckets > originally discounted in phase#2 to bubble up into the topN and be returned > to clients *with inaccurate counts/stats* > The simplest way to demonstrate this bug (in some data sets) is with a > {{sort: 'count asc'}} facet: > * assume shard1 returns termX & termY in phase#1 because they have very low > shard1 counts > ** but *not* returned at all by shard2, because these terms both have very > high shard2 counts. > * Assume termX has a slightly lower shard1 count then termY, such that: > ** termX "makes the cut" off for the limit=N topN buckets > ** termY does not make the cut, and is the "N+1" known bucket at the end of > phase#1 > * termX then gets included in the phase#2 refinement request against shard2 > ** termX now has a much higher _known_ total count then termY > ** the coordinator now sorts termX "worse" in the sorted list of buckets > then termY > ** which causes termY to bubble up into the topN > * termY is ultimately included in the final result _with incomplete > count/stat/sub-facet data_ instead of termX > ** this is all indepenent of the possibility that termY may actually have a > significantly higher total count then termX across the entire collection > ** the key problem is that all/most of the other terms returned to the > client have counts/stats that are the cumulation of all shards, but termY > only has the contributions from shard1 > Important Notes: > * This scenerio can happen regardless of the amount of overrequest used. > Additional overrequest just increases the number of "extra" terms needed in > the index with "better" sort values then termX & termY in shard2 > * {{sort: 'count asc'}} is not just an exceptional/pathelogical case: > ** any function sort where additional data provided shards during refinement > can cause a bucket to "sort worse" can also cause this problem. > ** Examples: {{sum(price_i) asc}} , {{min(price_i) desc}} , {{avg(price_i) > asc|desc}} , etc... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets
[ https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16536065#comment-16536065 ] Lucene/Solr QA commented on SOLR-12343: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 59s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 49s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 49s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} Release audit (RAT) {color} | {color:green} 1m 55s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} Check forbidden APIs {color} | {color:green} 1m 49s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} Validate source patterns {color} | {color:green} 1m 49s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} Validate ref guide {color} | {color:green} 1m 49s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 65m 33s{color} | {color:red} core in the patch failed. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 73m 9s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | solr.cloud.autoscaling.IndexSizeTriggerTest | | | solr.cloud.api.collections.ShardSplitTest | | | solr.cloud.ForceLeaderTest | | | solr.cloud.api.collections.TestCollectionsAPIViaSolrCloudCluster | | | solr.cloud.autoscaling.sim.TestLargeCluster | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | SOLR-12343 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12930572/SOLR-12343.patch | | Optional Tests | compile javac unit ratsources checkforbiddenapis validatesourcepatterns validaterefguide | | uname | Linux lucene1-us-west 3.13.0-88-generic #135-Ubuntu SMP Wed Jun 8 21:10:42 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | ant | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-SOLR-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh | | git revision | master / b7d14c5 | | ant | version: Apache Ant(TM) version 1.9.3 compiled on April 8 2014 | | Default Java | 1.8.0_172 | | unit | https://builds.apache.org/job/PreCommit-SOLR-Build/140/artifact/out/patch-unit-solr_core.txt | | Test Results | https://builds.apache.org/job/PreCommit-SOLR-Build/140/testReport/ | | modules | C: solr/core solr/solr-ref-guide U: solr | | Console output | https://builds.apache.org/job/PreCommit-SOLR-Build/140/console | | Powered by | Apache Yetus 0.7.0 http://yetus.apache.org | This message was automatically generated. > JSON Field Facet refinement can return incorrect counts/stats for sorted > buckets > > > Key: SOLR-12343 > URL: https://issues.apache.org/jira/browse/SOLR-12343 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Yonik Seeley >Priority: Major > Attachments: SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, > SOLR-12343.patch, SOLR-12343.patch > > > The way JSON Facet's simple refinement "re-sorts" buckets after refinement > can cause _refined_ buckets to be "bumped out" of the topN based on the > refined counts/stats depending on the sort - causing _unrefined_ buckets > originally discounted in phase#2 to bubble up into the topN and be returned > to clients *with inaccurate counts/stats* > The simplest way to demonstrate this bug (in some data sets) is with a > {{sort: 'count asc'}} facet: > * assume shard1 returns termX & termY in phase#1 because they have very low > shard1 counts > ** but *not* returned at all by shard2, because these terms both have very > high shard2 counts. > * Assume termX has a slightly lower shard1 count then termY, such that: > ** termX "makes the cut" off for the limit=N topN buckets > ** termY does not make the cut, and is the "N+1" known bucket at the end of > phase#1 > * termX then gets included in the phase#2 refinement request against shard2 > ** termX now has a much
[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets
[ https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534377#comment-16534377 ] Yonik Seeley commented on SOLR-12343: - bq. it will stop returning the facet range "other" buckets completely since currently no code refines them at all Hmmm, so the patch I attached seems like it would only remove incomplete buckets in field facets under "other" buckets (i.e. if they don't actually need refining to be complete, they won't be removed by the current patch). But this could still be worse in some cases (missing vs incomplete when refinement is requested), so I agree this can wait until SOLR-12516 is done. > JSON Field Facet refinement can return incorrect counts/stats for sorted > buckets > > > Key: SOLR-12343 > URL: https://issues.apache.org/jira/browse/SOLR-12343 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Yonik Seeley >Priority: Major > Attachments: SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, > SOLR-12343.patch > > > The way JSON Facet's simple refinement "re-sorts" buckets after refinement > can cause _refined_ buckets to be "bumped out" of the topN based on the > refined counts/stats depending on the sort - causing _unrefined_ buckets > originally discounted in phase#2 to bubble up into the topN and be returned > to clients *with inaccurate counts/stats* > The simplest way to demonstrate this bug (in some data sets) is with a > {{sort: 'count asc'}} facet: > * assume shard1 returns termX & termY in phase#1 because they have very low > shard1 counts > ** but *not* returned at all by shard2, because these terms both have very > high shard2 counts. > * Assume termX has a slightly lower shard1 count then termY, such that: > ** termX "makes the cut" off for the limit=N topN buckets > ** termY does not make the cut, and is the "N+1" known bucket at the end of > phase#1 > * termX then gets included in the phase#2 refinement request against shard2 > ** termX now has a much higher _known_ total count then termY > ** the coordinator now sorts termX "worse" in the sorted list of buckets > then termY > ** which causes termY to bubble up into the topN > * termY is ultimately included in the final result _with incomplete > count/stat/sub-facet data_ instead of termX > ** this is all indepenent of the possibility that termY may actually have a > significantly higher total count then termX across the entire collection > ** the key problem is that all/most of the other terms returned to the > client have counts/stats that are the cumulation of all shards, but termY > only has the contributions from shard1 > Important Notes: > * This scenerio can happen regardless of the amount of overrequest used. > Additional overrequest just increases the number of "extra" terms needed in > the index with "better" sort values then termX & termY in shard2 > * {{sort: 'count asc'}} is not just an exceptional/pathelogical case: > ** any function sort where additional data provided shards during refinement > can cause a bucket to "sort worse" can also cause this problem. > ** Examples: {{sum(price_i) asc}} , {{min(price_i) desc}} , {{avg(price_i) > asc|desc}} , etc... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets
[ https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16533943#comment-16533943 ] Hoss Man commented on SOLR-12343: - yonik: hold up ... i put this on the backburner because of SOLR-12516 (which i'm currently actively working on) Fixing SOLR-1234 before SOLR-12516 will make SOLR-12516 a lot worse in the common case (it will stop returning the facet range "other" buckets completely since currently no code refines them at all) > JSON Field Facet refinement can return incorrect counts/stats for sorted > buckets > > > Key: SOLR-12343 > URL: https://issues.apache.org/jira/browse/SOLR-12343 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Yonik Seeley >Priority: Major > Attachments: SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, > SOLR-12343.patch > > > The way JSON Facet's simple refinement "re-sorts" buckets after refinement > can cause _refined_ buckets to be "bumped out" of the topN based on the > refined counts/stats depending on the sort - causing _unrefined_ buckets > originally discounted in phase#2 to bubble up into the topN and be returned > to clients *with inaccurate counts/stats* > The simplest way to demonstrate this bug (in some data sets) is with a > {{sort: 'count asc'}} facet: > * assume shard1 returns termX & termY in phase#1 because they have very low > shard1 counts > ** but *not* returned at all by shard2, because these terms both have very > high shard2 counts. > * Assume termX has a slightly lower shard1 count then termY, such that: > ** termX "makes the cut" off for the limit=N topN buckets > ** termY does not make the cut, and is the "N+1" known bucket at the end of > phase#1 > * termX then gets included in the phase#2 refinement request against shard2 > ** termX now has a much higher _known_ total count then termY > ** the coordinator now sorts termX "worse" in the sorted list of buckets > then termY > ** which causes termY to bubble up into the topN > * termY is ultimately included in the final result _with incomplete > count/stat/sub-facet data_ instead of termX > ** this is all indepenent of the possibility that termY may actually have a > significantly higher total count then termX across the entire collection > ** the key problem is that all/most of the other terms returned to the > client have counts/stats that are the cumulation of all shards, but termY > only has the contributions from shard1 > Important Notes: > * This scenerio can happen regardless of the amount of overrequest used. > Additional overrequest just increases the number of "extra" terms needed in > the index with "better" sort values then termX & termY in shard2 > * {{sort: 'count asc'}} is not just an exceptional/pathelogical case: > ** any function sort where additional data provided shards during refinement > can cause a bucket to "sort worse" can also cause this problem. > ** Examples: {{sum(price_i) asc}} , {{min(price_i) desc}} , {{avg(price_i) > asc|desc}} , etc... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets
[ https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530657#comment-16530657 ] Yonik Seeley commented on SOLR-12343: - I think some of what I just worked on for SOLR-12326 is related to (or can be used by) this issue. FacetRequestSortedMerger now has a "BitSet shardHasMoreBuckets" to help deal with the fact that complete buckets do not need participation from every shard. That info in conjunction with Context.sawShard should be enough to tell if a bucket is already "complete". For every bucket that isn't complete, we can either refine it, or drop it. > JSON Field Facet refinement can return incorrect counts/stats for sorted > buckets > > > Key: SOLR-12343 > URL: https://issues.apache.org/jira/browse/SOLR-12343 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Priority: Major > Attachments: SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch > > > The way JSON Facet's simple refinement "re-sorts" buckets after refinement > can cause _refined_ buckets to be "bumped out" of the topN based on the > refined counts/stats depending on the sort - causing _unrefined_ buckets > originally discounted in phase#2 to bubble up into the topN and be returned > to clients *with inaccurate counts/stats* > The simplest way to demonstrate this bug (in some data sets) is with a > {{sort: 'count asc'}} facet: > * assume shard1 returns termX & termY in phase#1 because they have very low > shard1 counts > ** but *not* returned at all by shard2, because these terms both have very > high shard2 counts. > * Assume termX has a slightly lower shard1 count then termY, such that: > ** termX "makes the cut" off for the limit=N topN buckets > ** termY does not make the cut, and is the "N+1" known bucket at the end of > phase#1 > * termX then gets included in the phase#2 refinement request against shard2 > ** termX now has a much higher _known_ total count then termY > ** the coordinator now sorts termX "worse" in the sorted list of buckets > then termY > ** which causes termY to bubble up into the topN > * termY is ultimately included in the final result _with incomplete > count/stat/sub-facet data_ instead of termX > ** this is all indepenent of the possibility that termY may actually have a > significantly higher total count then termX across the entire collection > ** the key problem is that all/most of the other terms returned to the > client have counts/stats that are the cumulation of all shards, but termY > only has the contributions from shard1 > Important Notes: > * This scenerio can happen regardless of the amount of overrequest used. > Additional overrequest just increases the number of "extra" terms needed in > the index with "better" sort values then termX & termY in shard2 > * {{sort: 'count asc'}} is not just an exceptional/pathelogical case: > ** any function sort where additional data provided shards during refinement > can cause a bucket to "sort worse" can also cause this problem. > ** Examples: {{sum(price_i) asc}} , {{min(price_i) desc}} , {{avg(price_i) > asc|desc}} , etc... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets
[ https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16519673#comment-16519673 ] Hoss Man commented on SOLR-12343: - {quote}Not sure if it relates to this bug... {quote} No, that's an unrelated stupid test mistake that i've fixed locally and am currently hammering on – but that's for pointing it out! > JSON Field Facet refinement can return incorrect counts/stats for sorted > buckets > > > Key: SOLR-12343 > URL: https://issues.apache.org/jira/browse/SOLR-12343 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Priority: Major > Attachments: SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch > > > The way JSON Facet's simple refinement "re-sorts" buckets after refinement > can cause _refined_ buckets to be "bumped out" of the topN based on the > refined counts/stats depending on the sort - causing _unrefined_ buckets > originally discounted in phase#2 to bubble up into the topN and be returned > to clients *with inaccurate counts/stats* > The simplest way to demonstrate this bug (in some data sets) is with a > {{sort: 'count asc'}} facet: > * assume shard1 returns termX & termY in phase#1 because they have very low > shard1 counts > ** but *not* returned at all by shard2, because these terms both have very > high shard2 counts. > * Assume termX has a slightly lower shard1 count then termY, such that: > ** termX "makes the cut" off for the limit=N topN buckets > ** termY does not make the cut, and is the "N+1" known bucket at the end of > phase#1 > * termX then gets included in the phase#2 refinement request against shard2 > ** termX now has a much higher _known_ total count then termY > ** the coordinator now sorts termX "worse" in the sorted list of buckets > then termY > ** which causes termY to bubble up into the topN > * termY is ultimately included in the final result _with incomplete > count/stat/sub-facet data_ instead of termX > ** this is all indepenent of the possibility that termY may actually have a > significantly higher total count then termX across the entire collection > ** the key problem is that all/most of the other terms returned to the > client have counts/stats that are the cumulation of all shards, but termY > only has the contributions from shard1 > Important Notes: > * This scenerio can happen regardless of the amount of overrequest used. > Additional overrequest just increases the number of "extra" terms needed in > the index with "better" sort values then termX & termY in shard2 > * {{sort: 'count asc'}} is not just an exceptional/pathelogical case: > ** any function sort where additional data provided shards during refinement > can cause a bucket to "sort worse" can also cause this problem. > ** Examples: {{sum(price_i) asc}} , {{min(price_i) desc}} , {{avg(price_i) > asc|desc}} , etc... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets
[ https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16518698#comment-16518698 ] Steve Rowe commented on SOLR-12343: --- Not sure if it relates to this bug -- please move/add if not -- but my Jenkins found a reproducing failure for {{TestCloudJSONFacetSKG.testBespoke()}}: {noformat} Checking out Revision 008bc74bebef96414f19118a267dbf982aba58b9 (refs/remotes/origin/master) [...] ant test -Dtestcase=TestCloudJSONFacetSKG -Dtests.method=testBespoke -Dtests.seed=5D223D88BF5BF89 -Dtests.slow=true -Dtests.locale=bg-BG -Dtests.timezone=America/Asuncion -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1 [junit4] FAILURE 0.11s J0 | TestCloudJSONFacetSKG.testBespoke <<< [junit4]> Throwable #1: java.lang.AssertionError: Didn't check a single bucket??? [junit4]>at __randomizedtesting.SeedInfo.seed([5D223D88BF5BF89:E09A7E14375787E]:0) [junit4]>at org.apache.solr.cloud.TestCloudJSONFacetSKG.testBespoke(TestCloudJSONFacetSKG.java:219) [junit4]>at java.lang.Thread.run(Thread.java:748) [...] [junit4] 2> NOTE: test params are: codec=FastCompressingStoredFields(storedFieldsFormat=CompressingStoredFieldsFormat(compressionMode=FAST, chunkSize=4, maxDocsPerChunk=1, blockSize=332), termVectorsFormat=CompressingTermVectorsFormat(compressionMode=FAST, chunkSize=4, blockSize=332)), sim=Asserting(org.apache.lucene.search.similarities.AssertingSimilarity@4052d535), locale=el, timezone=Indian/Antananarivo [junit4] 2> NOTE: Linux 4.1.0-custom2-amd64 amd64/Oracle Corporation 1.8.0_151 (64-bit)/cpus=16,threads=1,free=213710424,total=526909440 {noformat} > JSON Field Facet refinement can return incorrect counts/stats for sorted > buckets > > > Key: SOLR-12343 > URL: https://issues.apache.org/jira/browse/SOLR-12343 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Priority: Major > Attachments: SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch > > > The way JSON Facet's simple refinement "re-sorts" buckets after refinement > can cause _refined_ buckets to be "bumped out" of the topN based on the > refined counts/stats depending on the sort - causing _unrefined_ buckets > originally discounted in phase#2 to bubble up into the topN and be returned > to clients *with inaccurate counts/stats* > The simplest way to demonstrate this bug (in some data sets) is with a > {{sort: 'count asc'}} facet: > * assume shard1 returns termX & termY in phase#1 because they have very low > shard1 counts > ** but *not* returned at all by shard2, because these terms both have very > high shard2 counts. > * Assume termX has a slightly lower shard1 count then termY, such that: > ** termX "makes the cut" off for the limit=N topN buckets > ** termY does not make the cut, and is the "N+1" known bucket at the end of > phase#1 > * termX then gets included in the phase#2 refinement request against shard2 > ** termX now has a much higher _known_ total count then termY > ** the coordinator now sorts termX "worse" in the sorted list of buckets > then termY > ** which causes termY to bubble up into the topN > * termY is ultimately included in the final result _with incomplete > count/stat/sub-facet data_ instead of termX > ** this is all indepenent of the possibility that termY may actually have a > significantly higher total count then termX across the entire collection > ** the key problem is that all/most of the other terms returned to the > client have counts/stats that are the cumulation of all shards, but termY > only has the contributions from shard1 > Important Notes: > * This scenerio can happen regardless of the amount of overrequest used. > Additional overrequest just increases the number of "extra" terms needed in > the index with "better" sort values then termX & termY in shard2 > * {{sort: 'count asc'}} is not just an exceptional/pathelogical case: > ** any function sort where additional data provided shards during refinement > can cause a bucket to "sort worse" can also cause this problem. > ** Examples: {{sum(price_i) asc}} , {{min(price_i) desc}} , {{avg(price_i) > asc|desc}} , etc... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets
[ https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16516506#comment-16516506 ] Hoss Man commented on SOLR-12343: - Updated patch with more tests and some code tweaks based on a few things the new tests caught. Still outstanding is the question of the new BitSets I added... {quote} * buckets now keep track of how many shards contributed to them ... ** there's a nocommit in here about the possibility of re-using the {{Context.sawShard}} BitSet instead – but i couldn't wrap my head around an efficient way to do it so i punted * ...buckets are excluded if a bucket doesn't have contributions from as many shards as the FacetField... ** again, i needed a new BitSet in at the FacetField level to count the shards – because Context.numShards may include shards that never return any results for the facet (ie: empty shard) so they never merge any data at all) {quote} I _think_ it should be possible to re-implement the {{FacetBucket.getNumShardsMerged()}} method (i added) using {{Context.sawShard}} by using {{sawShard.get(bucketNum * numShards, bucketNum * numShards + numShards)}} to take a "slice" of the BitSet just for the current bucket and then look at it's cardinality. the added cost of taking the slice only for buckets being considered in sorted order is probably a better trade off them the overhead of creating a new BitSet for every FacetBucket even if they are never considered for the response. But I still don't see anyway to efficiently figure out the "shards that participated" info needed at the {{FacetField}} level using the existing {{sawShard}} BitSet -- particularly with the changes I had to make to account for the case where a shard has docs participating in a facet, but not matching any buckets (see {{testSortedSubFacetRefinementWhenParentOnlyReturnedByOneShard}} ). Fortunately that's just one new BitSet per FacetField instance (not per bucket). I'll look at refactoring {{FacetBucket.getNumShardsMerged()}} to use {{Context.sawShard}} soon. > JSON Field Facet refinement can return incorrect counts/stats for sorted > buckets > > > Key: SOLR-12343 > URL: https://issues.apache.org/jira/browse/SOLR-12343 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Priority: Major > Attachments: SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch > > > The way JSON Facet's simple refinement "re-sorts" buckets after refinement > can cause _refined_ buckets to be "bumped out" of the topN based on the > refined counts/stats depending on the sort - causing _unrefined_ buckets > originally discounted in phase#2 to bubble up into the topN and be returned > to clients *with inaccurate counts/stats* > The simplest way to demonstrate this bug (in some data sets) is with a > {{sort: 'count asc'}} facet: > * assume shard1 returns termX & termY in phase#1 because they have very low > shard1 counts > ** but *not* returned at all by shard2, because these terms both have very > high shard2 counts. > * Assume termX has a slightly lower shard1 count then termY, such that: > ** termX "makes the cut" off for the limit=N topN buckets > ** termY does not make the cut, and is the "N+1" known bucket at the end of > phase#1 > * termX then gets included in the phase#2 refinement request against shard2 > ** termX now has a much higher _known_ total count then termY > ** the coordinator now sorts termX "worse" in the sorted list of buckets > then termY > ** which causes termY to bubble up into the topN > * termY is ultimately included in the final result _with incomplete > count/stat/sub-facet data_ instead of termX > ** this is all indepenent of the possibility that termY may actually have a > significantly higher total count then termX across the entire collection > ** the key problem is that all/most of the other terms returned to the > client have counts/stats that are the cumulation of all shards, but termY > only has the contributions from shard1 > Important Notes: > * This scenerio can happen regardless of the amount of overrequest used. > Additional overrequest just increases the number of "extra" terms needed in > the index with "better" sort values then termX & termY in shard2 > * {{sort: 'count asc'}} is not just an exceptional/pathelogical case: > ** any function sort where additional data provided shards during refinement > can cause a bucket to "sort worse" can also cause this problem. > ** Examples: {{sum(price_i) asc}} , {{min(price_i) desc}} , {{avg(price_i) > asc|desc}} , etc... -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets
[ https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16490020#comment-16490020 ] Hoss Man commented on SOLR-12343: - {quote}there is a new {{overrefine:N}} option which works similar to overrequest – but instead of determining how many "extra" terms to request in phase#1, it determines how many "extra" buckets should be in {{numBucketsToCheck}} for refinement in phase #2 ... {quote} It occurs to me now, that adding this option should also provide a "solution" for SOLR-11733 ... people who are concerned about refining long tail terms can set {{overrefine}} really high. > JSON Field Facet refinement can return incorrect counts/stats for sorted > buckets > > > Key: SOLR-12343 > URL: https://issues.apache.org/jira/browse/SOLR-12343 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Priority: Major > Attachments: SOLR-12343.patch, SOLR-12343.patch > > > The way JSON Facet's simple refinement "re-sorts" buckets after refinement > can cause _refined_ buckets to be "bumped out" of the topN based on the > refined counts/stats depending on the sort - causing _unrefined_ buckets > originally discounted in phase#2 to bubble up into the topN and be returned > to clients *with inaccurate counts/stats* > The simplest way to demonstrate this bug (in some data sets) is with a > {{sort: 'count asc'}} facet: > * assume shard1 returns termX & termY in phase#1 because they have very low > shard1 counts > ** but *not* returned at all by shard2, because these terms both have very > high shard2 counts. > * Assume termX has a slightly lower shard1 count then termY, such that: > ** termX "makes the cut" off for the limit=N topN buckets > ** termY does not make the cut, and is the "N+1" known bucket at the end of > phase#1 > * termX then gets included in the phase#2 refinement request against shard2 > ** termX now has a much higher _known_ total count then termY > ** the coordinator now sorts termX "worse" in the sorted list of buckets > then termY > ** which causes termY to bubble up into the topN > * termY is ultimately included in the final result _with incomplete > count/stat/sub-facet data_ instead of termX > ** this is all indepenent of the possibility that termY may actually have a > significantly higher total count then termX across the entire collection > ** the key problem is that all/most of the other terms returned to the > client have counts/stats that are the cumulation of all shards, but termY > only has the contributions from shard1 > Important Notes: > * This scenerio can happen regardless of the amount of overrequest used. > Additional overrequest just increases the number of "extra" terms needed in > the index with "better" sort values then termX & termY in shard2 > * {{sort: 'count asc'}} is not just an exceptional/pathelogical case: > ** any function sort where additional data provided shards during refinement > can cause a bucket to "sort worse" can also cause this problem. > ** Examples: {{sum(price_i) asc}} , {{min(price_i) desc}} , {{avg(price_i) > asc|desc}} , etc... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets
[ https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487839#comment-16487839 ] Hoss Man commented on SOLR-12343: - {quote}... I think it should just be considered a bug. {quote} That's pretty much my feeling, but I wasn't sure. {quote}Truncating the list of buckets to N before the refinement phase would fix the bug, but it would also throw away complete buckets that could make it into the top N after refinement. {quote} oh right ... yeah, i was forgetting about buckets that got data from all shards in phase #1. {quote}Exactly which buckets we chose to refine (and exactly how many) can remain an implementation detail. ... {quote} right ... it can be heuristically determined, and very conservative in cases where we know it doesn't matter – but i still think there should be an explicit option... I worked up a patch similar to the straw man i outlined above – except that i didn't add the {{refine:required}} variant since we're in agreement that this is a bug. In the new patch: * buckets now keep track of how many shards contributed to them ** I did this with a quick and dirty BitSet instead of an {{int numShardsContributing}} counter since we have to handle the possibility that {{mergeBuckets()}} will get called more then once for a single shard when we have partial refinement of sub-facets ** there's a nocommit in here about the possibility of re-using the {{Context.sawShard}} BitSet instead – but i couldn't wrap my head around an efficient way to do it so i punted * during the final "pruning" in {{FacetFieldMerger.getMergedResult()}} buckets are excluded if a bucket doesn't have contributions from as many shards as the FacetField ** again, i needed a new BitSet in at the FacetField level to count the shards – because {{Context.numShards}} may include shards that never return *any* results for the facet (ie: empty shard) so they never merge any data at all) * there is a new {{overrefine:N}} option which works similar to overrequest – but instead of determining how many "extra" terms to request in phase#1, it determines how many "extra" buckets should be in {{numBucketsToCheck}} for refinement in phase #2 (but if some buckets are already fully populated in phase #2, then the actual number "refined" in phase#2 can be lower then limit+overrefine) ** the default hueristic currently pays attention to the sort – since (IIUC) {{count desc}} and {{index asc|desc}} should never need any "over refinement" unless {{mincount > 1}} ** if we have a non-trivial sort, and the user specified an explicit {{overrequest:N}} then the default hueristic for {{overrefine}} uses the same value {{N}} *** because i'm assuming if people have explicitly requested {{sort:SPECIAL, refine:true, overrequest:N}} then they care about the accuracy of the the terms to some degree N, and the bigger N is the more we should care about over-refinement as well. ** if neither {{overrequest}} or {{overrefine}} are explicitly set, then we use the same {{limit * 1.1 + 4}} type hueristic as {{overrequest}} ** there's another nocommit here though: if we're using a hueritic, should we be scaling the derived {{numBucketsToCheck}} based on {{mincount}} ? ... if {{mincount=M > 1}} should we be doing something like {{numBucketsToCheck *= M}} ?? *** although, thinking about it now – this kind of mincount based factor would probably make more sense in the {{overrequest}} hueristic? maybe for {{overrefine}} we should look at how many buckets were already fully populated in phase#1 _AND_ meet the mincount, and use the the difference between that number and the limit to decide a scaling factor? *** either way: can probably TODO this for a future enhancement. * Testing wise... ** These changes fix the problems in previous test patch ** I've also added some more tests, but there's nocommit's to add a lot more including verification of nested facets ** I didn't want to go too deep down the testing rabbit hole until i was sure we wanted to go this route. what do you think? > JSON Field Facet refinement can return incorrect counts/stats for sorted > buckets > > > Key: SOLR-12343 > URL: https://issues.apache.org/jira/browse/SOLR-12343 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Priority: Major > Attachments: SOLR-12343.patch > > > The way JSON Facet's simple refinement "re-sorts" buckets after refinement > can cause _refined_ buckets to be "bumped out" of the topN based on the > refined counts/stats depending on the sort - causing _unrefined_ buckets > originally discounted in phase#2 to bubble up into the topN and be returned > to clients *with
[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets
[ https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481366#comment-16481366 ] Yonik Seeley commented on SOLR-12343: - I think the most important thing here is that individual buckets should have correct stats. The behavior uncovered here was not intentional and isn't useful, so I think it should just be considered a bug. Truncating the list of buckets to N before the refinement phase would fix the bug, but it would also throw away complete buckets that could make it into the top N after refinement. One could tweak to only throw away incomplete buckets after the top N, but that still leaves the filtering complications you brought up. In the long term, perhaps a cursorMark approach would work better in conjunction with filtering? Although it does feel like paging facets is a less important feature in general. Exactly which buckets we chose to refine (and exactly how many) can remain an implementation detail. The essence of the simple refinement algorithm is: 1) collect top buckets from each shard 2) refine some subset of those buckets (refinement == ensure every shard that can contribute to that bucket has) 3) return only refined buckets > JSON Field Facet refinement can return incorrect counts/stats for sorted > buckets > > > Key: SOLR-12343 > URL: https://issues.apache.org/jira/browse/SOLR-12343 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Priority: Major > Attachments: SOLR-12343.patch > > > The way JSON Facet's simple refinement "re-sorts" buckets after refinement > can cause _refined_ buckets to be "bumped out" of the topN based on the > refined counts/stats depending on the sort - causing _unrefined_ buckets > originally discounted in phase#2 to bubble up into the topN and be returned > to clients *with inaccurate counts/stats* > The simplest way to demonstrate this bug (in some data sets) is with a > {{sort: 'count asc'}} facet: > * assume shard1 returns termX & termY in phase#1 because they have very low > shard1 counts > ** but *not* returned at all by shard2, because these terms both have very > high shard2 counts. > * Assume termX has a slightly lower shard1 count then termY, such that: > ** termX "makes the cut" off for the limit=N topN buckets > ** termY does not make the cut, and is the "N+1" known bucket at the end of > phase#1 > * termX then gets included in the phase#2 refinement request against shard2 > ** termX now has a much higher _known_ total count then termY > ** the coordinator now sorts termX "worse" in the sorted list of buckets > then termY > ** which causes termY to bubble up into the topN > * termY is ultimately included in the final result _with incomplete > count/stat/sub-facet data_ instead of termX > ** this is all indepenent of the possibility that termY may actually have a > significantly higher total count then termX across the entire collection > ** the key problem is that all/most of the other terms returned to the > client have counts/stats that are the cumulation of all shards, but termY > only has the contributions from shard1 > Important Notes: > * This scenerio can happen regardless of the amount of overrequest used. > Additional overrequest just increases the number of "extra" terms needed in > the index with "better" sort values then termX & termY in shard2 > * {{sort: 'count asc'}} is not just an exceptional/pathelogical case: > ** any function sort where additional data provided shards during refinement > can cause a bucket to "sort worse" can also cause this problem. > ** Examples: {{sum(price_i) asc}} , {{min(price_i) desc}} , {{avg(price_i) > asc|desc}} , etc... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets
[ https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471335#comment-16471335 ] Hoss Man commented on SOLR-12343: - My initial thinking was that {{FacetRequestSortedMerger.sortBuckets()}} should go ahead and truncate the list of bucket based on the {{limit+offset}} as the very last thing it does – for the "pre-refinement" call to sortBuckets() this wouldn't change anything about the buckets selected for refinement, and for the "post-refinement" call to sortBuckets() it would only change the order of the buckets already refined – bug goes away. There's even a comment on the pre-refinement call that says {{// todo: make sure this filters buckets as well}} which seemed to be directly on point. Except... looking at the post-refinement use of sortBuckets() in FacetFieldMerger, I realize that the {{mincount}} type "filtering" (which is probably what that '{{// todo'}} actually refered to) needs to be applied *after* the buckets are sorting, but before pruning down down bsad on the offset+limit. With something like {{count desc}} it wouldn't matter if we "pre-truncate" the list, because if any of the refined buckets don't have a count>mincount, then there's no chance any of the un-refined buckets will satisfy that mincount either ... but for things like {{index asc|desc}} or sorting by functions: it definitely matters in order to ensure we return the full "limit" # of buckets. Although, I guess a key question i have is: if the user has explicitly requested refinement, then is there really any value in returning the full "limit" # of buckets if some of those buckets aren't refined? That really seems like the crux of this bug: to me, it seems like when refinement is requested we should *NEVER* return an unrefined bucket (ie: a bucket that is lying about it's count/stats) ... but I can imagine other folks might feel differently. Anyone have strong opinions? For now, I'll assume the current behavior is considered desirable by some, and brain storm potential enhancements to make it optional... Perhaps we should add a new {{refine:required}} variant? If the user says refinement is required, then {{sortBuckets()}} could pre-truncate. Or maybe better still: * we add an {{int numShardsContributing = 1}} to {{FacetBucket}} that gets incremented every time a shard is merged in. * Add the new {{refine:required}} option but implemented differently... ** {{sortBuckets()}} doesn't change – leave all the un-refined buckets in {{sortedBuckets}} all the the time ** Consumers of {{sortedBuckets}} (like {{FacetFieldMerger.getMergedResult()}} ) are responsible for checking the type of refinement: *** if it was {{required}} , then filter the buckets on {{numShardsContributing}} just like the existing filtering on mincount in the same loop * *Additionally:* add a new {{overrefine:N}} option that can be use in conjunction with, or independently from {{refine:required}} ** Default to '0' for back compat ** used during refinement phase similar to how "overrequest" is used during the initial request *** ie: {{FacetRequestSortedMerger}} would add it to the limit when computing {{numBucketsToCheck}} This way, clients that are willing to "pay extra" during refinement can request that additional terms get refined – which can be useful for non-trivial sorts to ensure that the "best" buckets really are returned. Independently clients can indicate if they are unwilling to accept un-refined buckets in the response because they care about accuracy, or would rather have as many buckets (up to limit) returned as possible, even if they couldn't be refined. What do folks think? [~yo...@apache.org] do you see any problems with this approach? or have alternative suggestions? > JSON Field Facet refinement can return incorrect counts/stats for sorted > buckets > > > Key: SOLR-12343 > URL: https://issues.apache.org/jira/browse/SOLR-12343 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Priority: Major > Attachments: SOLR-12343.patch > > > The way JSON Facet's simple refinement "re-sorts" buckets after refinement > can cause _refined_ buckets to be "bumped out" of the topN based on the > refined counts/stats depending on the sort - causing _unrefined_ buckets > originally discounted in phase#2 to bubble up into the topN and be returned > to clients *with inaccurate counts/stats* > The simplest way to demonstrate this bug (in some data sets) is with a > {{sort: 'count asc'}} facet: > * assume shard1 returns termX & termY in phase#1 because they have very low > shard1 counts > ** but *not* returned at all by shard2, because
[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets
[ https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471156#comment-16471156 ] Hoss Man commented on SOLR-12343: - Ultimately what seems to be at issue here is a discrepency between how Yonik designed the "simple" facet algorithm, and how it's implemented – but its only problematic in these "additional information from refinement can make sort values 'worse'" type situations. As Yonik noted in SOLR-11733 regarding the design of {{refine:simple|true}} ... {quote}[compared to facet.field] ...the refinement algorithm being different (and for a single-level facet field, simpler). It can be explained as: 1) find buckets to return as if you weren't doing refinement 2) for those buckets, make sure all shards have contributed to the statistics i.e. simple refinement doesn't change the buckets you get back. {quote} But in actuality, adding {{refine:true}} _can_ change the buckets you get back. In my example above, if {{refine:false}} was used, termX would have ultimately been returned (with an unrefined count) – but because of refinement it's not returned, and termY is returned in it's place. I've attached a simple test patch demonstrating the problem but I haven't yet dug into the code to figure out the best fix. I _suspect_ what's needed (to stick to the intent of {{refine:simple}} ) is that after the coordinator picks buckets that need refined, it should prune down the list of "all known" (size {{limit=N + overrequest=R}}) buckets to just the "buckets to return" (size {{limit=N}}) so that once the refinement values come in the _set_ of buckets desn't change, even if the _order_ or the buckets does. > JSON Field Facet refinement can return incorrect counts/stats for sorted > buckets > > > Key: SOLR-12343 > URL: https://issues.apache.org/jira/browse/SOLR-12343 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Priority: Major > Attachments: SOLR-12343.patch > > > The way JSON Facet's simple refinement "re-sorts" buckets after refinement > can cause _refined_ buckets to be "bumped out" of the topN based on the > refined counts/stats depending on the sort - causing _unrefined_ buckets > originally discounted in phase#2 to bubble up into the topN and be returned > to clients *with inaccurate counts/stats* > The simplest way to demonstrate this bug (in some data sets) is with a > {{sort: 'count asc'}} facet: > * assume shard1 returns termX & termY in phase#1 because they have very low > shard1 counts > ** but *not* returned at all by shard2, because these terms both have very > high shard2 counts. > * Assume termX has a slightly lower shard1 count then termY, such that: > ** termX "makes the cut" off for the limit=N topN buckets > ** termY does not make the cut, and is the "N+1" known bucket at the end of > phase#1 > * termX then gets included in the phase#2 refinement request against shard2 > ** termX now has a much higher _known_ total count then termY > ** the coordinator now sorts termX "worse" in the sorted list of buckets > then termY > ** which causes termY to bubble up into the topN > * termY is ultimately included in the final result _with incomplete > count/stat/sub-facet data_ instead of termX > ** this is all indepenent of the possibility that termY may actually have a > significantly higher total count then termX across the entire collection > ** the key problem is that all/most of the other terms returned to the > client have counts/stats that are the cumulation of all shards, but termY > only has the contributions from shard1 > Important Notes: > * This scenerio can happen regardless of the amount of overrequest used. > Additional overrequest just increases the number of "extra" terms needed in > the index with "better" sort values then termX & termY in shard2 > * {{sort: 'count asc'}} is not just an exceptional/pathelogical case: > ** any function sort where additional data provided shards during refinement > can cause a bucket to "sort worse" can also cause this problem. > ** Examples: {{sum(price_i) asc}} , {{min(price_i) desc}} , {{avg(price_i) > asc|desc}} , etc... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org