[ 
https://issues.apache.org/jira/browse/SOLR-13045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16712293#comment-16712293
 ] 

Jason Gerlowski edited comment on SOLR-13045 at 12/7/18 3:21 AM:
-----------------------------------------------------------------

Looking at {{testCreateCollectionAddReplica}} first.  I'm still in the early 
stages of looking into this, but I think I see some things pointing to this 
being a sim-framework issue, as opposed to being a production problem.  I'm not 
super familiar with the sim-framework though, so I'll try and give some detail 
here in case anyone with more context can correct me and save me from a 
potential red-herring.

*TL;DR* I believe this to be a test-framework bug related to how the 
SimClusterStateProvider caches clusterstate values.

The test starts by creating a collection using a specific policy.  Maybe 1 time 
in 10 it'll fail in a {{CloudTestUtils.waitForState}} call.  On these failures, 
this {{waitForState}} call fails because the collection (supposedly) doesn't 
have a leader:
{code}
 last coll state: 
DocCollection(testCreateCollectionAddReplica//clusterstate.json/5)={
  "replicationFactor":"1",
  "pullReplicas":"0",
  "router":{"name":"compositeId"},
  "maxShardsPerNode":"1",
  "autoAddReplicas":"false",
  "nrtReplicas":"1",
  "tlogReplicas":"0",
  "autoCreated":"true",
  "policy":"c1",
  "shards":{"shard1":{
      "replicas":{"core_node1":{
          "core":"testCreateCollectionAddReplica_shard1_replica_n1",
          "SEARCHER.searcher.maxDoc":0,
          "SEARCHER.searcher.deletedDocs":0,
          "INDEX.sizeInBytes":10240,
          "node_name":"127.0.0.1:10068_solr",
          "state":"active",
          "type":"NRT",
          "INDEX.sizeInGB":9.5367431640625E-6,
          "SEARCHER.searcher.numDocs":0}},
      "range":"80000000-7fffffff",
      "state":"active"}}}
{code}

But other statements in the logs indicate that this collection *does* have a 
leader.  We get this series of messages right as the test ends:
{code}
14445 INFO  
(TEST-TestSimPolicyCloud.testCreateCollectionAddReplica-seed#[6FE5447E15D3DD6F])
 [    ] o.a.s.SolrTestCaseJ4 ###Ending testCreateCollectionAddReplica
14446 DEBUG 
(TEST-TestSimPolicyCloud.testCreateCollectionAddReplica-seed#[6FE5447E15D3DD6F])
 [    ] o.a.s.c.a.s.SimClusterStateProvider ** creating new collection states, 
currentVersion=6
14446 INFO  
(TEST-TestSimPolicyCloud.testCreateCollectionAddReplica-seed#[6FE5447E15D3DD6F])
 [    ] o.a.s.c.a.s.SimClusterStateProvider JEGERLOW: Saving clusterstate
14446 DEBUG 
(TEST-TestSimPolicyCloud.testCreateCollectionAddReplica-seed#[6FE5447E15D3DD6F])
 [    ] o.a.s.c.a.s.SimClusterStateProvider ** saved cluster state version 6
14446 INFO  
(TEST-TestSimPolicyCloud.testCreateCollectionAddReplica-seed#[6FE5447E15D3DD6F])
 [    ] o.a.s.c.a.s.SimSolrCloudTestCase #######################################
############ CLUSTER STATE ############
#######################################
## Live nodes:          2
## Empty nodes: 1
## Dead nodes:          0
## Collections:
##  * testCreateCollectionAddReplica
##    shardsTotal       1
##    shardsState       {active=1}
##      shardsWithoutLeader     0
{code}

One thing that stands out to me are the different clusterstate versions in play 
here.  The log snippets above show information from {{/clusterstate.json/5}}, 
and {{/clusterstate.json/6}} respectively.

I looked into {{SimClusterStateProvider}} and noticed that it caches the 
cluster state locally (see 
[here|https://github.com/apache/lucene-solr/blob/75b183196798232aa6f2dcaaaab117f309119053/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimClusterStateProvider.java#L2086])
 and warns readers that the cache must be explicitly cleared before new changes 
become visible.  With this caching temporarily disabled the test failure 
disappeared.  (Or at least, I couldn't trigger it in 2000 runs).  I suspect 
that the test failure is caused by either (1) some codepath not properly 
clearing/resetting this clusterstate cache, or (2) a subtler synchronization 
bug in how this cache is locked down.


was (Author: gerlowskija):
Looking at {{testCreateCollectionAddReplica}} first.  I'm still in the early 
stages of looking into this, but I think I see some things pointing to this 
being a sim-framework issue, as opposed to being a production problem.  I'm not 
super familiar with the sim-framework though, so I'll try and give some detail 
here in case anyone with more context can correct me and save me from a 
potential red-herring.

*TL;DR* I believe this to be a test-framework bug related to how the 
SimClusterStateProvider caches clusterstate values.

The test starts by creating a collection using a specific policy.  Maybe 1 time 
in 10 it'll fail in a {{CloudTestUtils.waitForState}} call.  On these failures, 
this {{waitForState}} call fails because the collection (supposedly) doesn't 
have a leader:
{code}
 last coll state: 
DocCollection(testCreateCollectionAddReplica//clusterstate.json/5)={
  "replicationFactor":"1",
  "pullReplicas":"0",
  "router":{"name":"compositeId"},
  "maxShardsPerNode":"1",
  "autoAddReplicas":"false",
  "nrtReplicas":"1",
  "tlogReplicas":"0",
  "autoCreated":"true",
  "policy":"c1",
  "shards":{"shard1":{
      "replicas":{"core_node1":{
          "core":"testCreateCollectionAddReplica_shard1_replica_n1",
          "SEARCHER.searcher.maxDoc":0,
          "SEARCHER.searcher.deletedDocs":0,
          "INDEX.sizeInBytes":10240,
          "node_name":"127.0.0.1:10068_solr",
          "state":"active",
          "type":"NRT",
          "INDEX.sizeInGB":9.5367431640625E-6,
          "SEARCHER.searcher.numDocs":0}},
      "range":"80000000-7fffffff",
      "state":"active"}}}
{code}

But other statements in the logs indicate that this collection *does* have a 
leader.  We get this series of messages right as the test ends:
{code}
14445 INFO  
(TEST-TestSimPolicyCloud.testCreateCollectionAddReplica-seed#[6FE5447E15D3DD6F])
 [    ] o.a.s.SolrTestCaseJ4 ###Ending testCreateCollectionAddReplica
14446 DEBUG 
(TEST-TestSimPolicyCloud.testCreateCollectionAddReplica-seed#[6FE5447E15D3DD6F])
 [    ] o.a.s.c.a.s.SimClusterStateProvider ** creating new collection states, 
currentVersion=6
14446 INFO  
(TEST-TestSimPolicyCloud.testCreateCollectionAddReplica-seed#[6FE5447E15D3DD6F])
 [    ] o.a.s.c.a.s.SimClusterStateProvider JEGERLOW: Saving clusterstate
14446 DEBUG 
(TEST-TestSimPolicyCloud.testCreateCollectionAddReplica-seed#[6FE5447E15D3DD6F])
 [    ] o.a.s.c.a.s.SimClusterStateProvider ** saved cluster state version 6
14446 INFO  
(TEST-TestSimPolicyCloud.testCreateCollectionAddReplica-seed#[6FE5447E15D3DD6F])
 [    ] o.a.s.c.a.s.SimSolrCloudTestCase #######################################
############ CLUSTER STATE ############
#######################################
## Live nodes:          2
## Empty nodes: 1
## Dead nodes:          0
## Collections:
##  * testCreateCollectionAddReplica
##    shardsTotal       1
##    shardsState       {active=1}
##      shardsWithoutLeader     0
{code}

One thing that stands out to me are the different clusterstate versions in play 
here.  The log snippets above show information from {{/clusterstate.json/5}}, 
and {{/clusterstate.json/6}} respectively.

I looked into {{SimClusterStateProvider}} and noticed that it caches the 
cluster state locally (see 
[here|https://github.com/apache/lucene-solr/blob/75b183196798232aa6f2dcaaaab117f309119053/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimClusterStateProvider.java#L2086]
 and warns readers that the cache must be explicitly cleared before new changes 
become visible.  With this caching temporarily disabled the test failure 
disappeared.  (Or at least, I couldn't trigger it in 2000 runs).  I suspect 
that the test failure is caused by either (1) some codepath not properly 
clearing/resetting this clusterstate cache, or (2) a subtler synchronization 
bug in how this cache is locked down.

> Harden TestSimPolicyCloud
> -------------------------
>
>                 Key: SOLR-13045
>                 URL: https://issues.apache.org/jira/browse/SOLR-13045
>             Project: Solr
>          Issue Type: Test
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: AutoScaling
>    Affects Versions: master (8.0)
>            Reporter: Jason Gerlowski
>            Priority: Major
>
> Several tests in TestSimPolicyCloud, but especially 
> {{testCreateCollectionAddReplica}}, have some flaky behavior, even after 
> Mark's recent test-fix commit.  This JIRA covers looking into and (hopefully) 
> fixing this test failure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to