[ https://issues.apache.org/jira/browse/SOLR-13045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16712293#comment-16712293 ]
Jason Gerlowski edited comment on SOLR-13045 at 12/7/18 3:21 AM: ----------------------------------------------------------------- Looking at {{testCreateCollectionAddReplica}} first. I'm still in the early stages of looking into this, but I think I see some things pointing to this being a sim-framework issue, as opposed to being a production problem. I'm not super familiar with the sim-framework though, so I'll try and give some detail here in case anyone with more context can correct me and save me from a potential red-herring. *TL;DR* I believe this to be a test-framework bug related to how the SimClusterStateProvider caches clusterstate values. The test starts by creating a collection using a specific policy. Maybe 1 time in 10 it'll fail in a {{CloudTestUtils.waitForState}} call. On these failures, this {{waitForState}} call fails because the collection (supposedly) doesn't have a leader: {code} last coll state: DocCollection(testCreateCollectionAddReplica//clusterstate.json/5)={ "replicationFactor":"1", "pullReplicas":"0", "router":{"name":"compositeId"}, "maxShardsPerNode":"1", "autoAddReplicas":"false", "nrtReplicas":"1", "tlogReplicas":"0", "autoCreated":"true", "policy":"c1", "shards":{"shard1":{ "replicas":{"core_node1":{ "core":"testCreateCollectionAddReplica_shard1_replica_n1", "SEARCHER.searcher.maxDoc":0, "SEARCHER.searcher.deletedDocs":0, "INDEX.sizeInBytes":10240, "node_name":"127.0.0.1:10068_solr", "state":"active", "type":"NRT", "INDEX.sizeInGB":9.5367431640625E-6, "SEARCHER.searcher.numDocs":0}}, "range":"80000000-7fffffff", "state":"active"}}} {code} But other statements in the logs indicate that this collection *does* have a leader. We get this series of messages right as the test ends: {code} 14445 INFO (TEST-TestSimPolicyCloud.testCreateCollectionAddReplica-seed#[6FE5447E15D3DD6F]) [ ] o.a.s.SolrTestCaseJ4 ###Ending testCreateCollectionAddReplica 14446 DEBUG (TEST-TestSimPolicyCloud.testCreateCollectionAddReplica-seed#[6FE5447E15D3DD6F]) [ ] o.a.s.c.a.s.SimClusterStateProvider ** creating new collection states, currentVersion=6 14446 INFO (TEST-TestSimPolicyCloud.testCreateCollectionAddReplica-seed#[6FE5447E15D3DD6F]) [ ] o.a.s.c.a.s.SimClusterStateProvider JEGERLOW: Saving clusterstate 14446 DEBUG (TEST-TestSimPolicyCloud.testCreateCollectionAddReplica-seed#[6FE5447E15D3DD6F]) [ ] o.a.s.c.a.s.SimClusterStateProvider ** saved cluster state version 6 14446 INFO (TEST-TestSimPolicyCloud.testCreateCollectionAddReplica-seed#[6FE5447E15D3DD6F]) [ ] o.a.s.c.a.s.SimSolrCloudTestCase ####################################### ############ CLUSTER STATE ############ ####################################### ## Live nodes: 2 ## Empty nodes: 1 ## Dead nodes: 0 ## Collections: ## * testCreateCollectionAddReplica ## shardsTotal 1 ## shardsState {active=1} ## shardsWithoutLeader 0 {code} One thing that stands out to me are the different clusterstate versions in play here. The log snippets above show information from {{/clusterstate.json/5}}, and {{/clusterstate.json/6}} respectively. I looked into {{SimClusterStateProvider}} and noticed that it caches the cluster state locally (see [here|https://github.com/apache/lucene-solr/blob/75b183196798232aa6f2dcaaaab117f309119053/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimClusterStateProvider.java#L2086]) and warns readers that the cache must be explicitly cleared before new changes become visible. With this caching temporarily disabled the test failure disappeared. (Or at least, I couldn't trigger it in 2000 runs). I suspect that the test failure is caused by either (1) some codepath not properly clearing/resetting this clusterstate cache, or (2) a subtler synchronization bug in how this cache is locked down. was (Author: gerlowskija): Looking at {{testCreateCollectionAddReplica}} first. I'm still in the early stages of looking into this, but I think I see some things pointing to this being a sim-framework issue, as opposed to being a production problem. I'm not super familiar with the sim-framework though, so I'll try and give some detail here in case anyone with more context can correct me and save me from a potential red-herring. *TL;DR* I believe this to be a test-framework bug related to how the SimClusterStateProvider caches clusterstate values. The test starts by creating a collection using a specific policy. Maybe 1 time in 10 it'll fail in a {{CloudTestUtils.waitForState}} call. On these failures, this {{waitForState}} call fails because the collection (supposedly) doesn't have a leader: {code} last coll state: DocCollection(testCreateCollectionAddReplica//clusterstate.json/5)={ "replicationFactor":"1", "pullReplicas":"0", "router":{"name":"compositeId"}, "maxShardsPerNode":"1", "autoAddReplicas":"false", "nrtReplicas":"1", "tlogReplicas":"0", "autoCreated":"true", "policy":"c1", "shards":{"shard1":{ "replicas":{"core_node1":{ "core":"testCreateCollectionAddReplica_shard1_replica_n1", "SEARCHER.searcher.maxDoc":0, "SEARCHER.searcher.deletedDocs":0, "INDEX.sizeInBytes":10240, "node_name":"127.0.0.1:10068_solr", "state":"active", "type":"NRT", "INDEX.sizeInGB":9.5367431640625E-6, "SEARCHER.searcher.numDocs":0}}, "range":"80000000-7fffffff", "state":"active"}}} {code} But other statements in the logs indicate that this collection *does* have a leader. We get this series of messages right as the test ends: {code} 14445 INFO (TEST-TestSimPolicyCloud.testCreateCollectionAddReplica-seed#[6FE5447E15D3DD6F]) [ ] o.a.s.SolrTestCaseJ4 ###Ending testCreateCollectionAddReplica 14446 DEBUG (TEST-TestSimPolicyCloud.testCreateCollectionAddReplica-seed#[6FE5447E15D3DD6F]) [ ] o.a.s.c.a.s.SimClusterStateProvider ** creating new collection states, currentVersion=6 14446 INFO (TEST-TestSimPolicyCloud.testCreateCollectionAddReplica-seed#[6FE5447E15D3DD6F]) [ ] o.a.s.c.a.s.SimClusterStateProvider JEGERLOW: Saving clusterstate 14446 DEBUG (TEST-TestSimPolicyCloud.testCreateCollectionAddReplica-seed#[6FE5447E15D3DD6F]) [ ] o.a.s.c.a.s.SimClusterStateProvider ** saved cluster state version 6 14446 INFO (TEST-TestSimPolicyCloud.testCreateCollectionAddReplica-seed#[6FE5447E15D3DD6F]) [ ] o.a.s.c.a.s.SimSolrCloudTestCase ####################################### ############ CLUSTER STATE ############ ####################################### ## Live nodes: 2 ## Empty nodes: 1 ## Dead nodes: 0 ## Collections: ## * testCreateCollectionAddReplica ## shardsTotal 1 ## shardsState {active=1} ## shardsWithoutLeader 0 {code} One thing that stands out to me are the different clusterstate versions in play here. The log snippets above show information from {{/clusterstate.json/5}}, and {{/clusterstate.json/6}} respectively. I looked into {{SimClusterStateProvider}} and noticed that it caches the cluster state locally (see [here|https://github.com/apache/lucene-solr/blob/75b183196798232aa6f2dcaaaab117f309119053/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimClusterStateProvider.java#L2086] and warns readers that the cache must be explicitly cleared before new changes become visible. With this caching temporarily disabled the test failure disappeared. (Or at least, I couldn't trigger it in 2000 runs). I suspect that the test failure is caused by either (1) some codepath not properly clearing/resetting this clusterstate cache, or (2) a subtler synchronization bug in how this cache is locked down. > Harden TestSimPolicyCloud > ------------------------- > > Key: SOLR-13045 > URL: https://issues.apache.org/jira/browse/SOLR-13045 > Project: Solr > Issue Type: Test > Security Level: Public(Default Security Level. Issues are Public) > Components: AutoScaling > Affects Versions: master (8.0) > Reporter: Jason Gerlowski > Priority: Major > > Several tests in TestSimPolicyCloud, but especially > {{testCreateCollectionAddReplica}}, have some flaky behavior, even after > Mark's recent test-fix commit. This JIRA covers looking into and (hopefully) > fixing this test failure. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org