[ 
https://issues.apache.org/jira/browse/SOLR-13045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16725884#comment-16725884
 ] 

Jason Gerlowski commented on SOLR-13045:
----------------------------------------

One of the remaining failures for in TestSimPolicyCloud occurs in 
{{testCreateCollectionAddShardUsingPolicy}} when the initial collection 
creation (and subsequent shard creation) seem to violate a policy which 
specifies that all replicas should be created on the same node.  After looking 
closer, it looks like this comes down to a race condition of sorts between two 
threads attempting to set the autoscaling.json "ZK" node.

Two different threads touch the autoscaling config node in this test: the 
OverseerTriggerThread tries to set the default nodeAdded trigger, and the test 
code tries to set a policy that the test relies on.  These threads rely on 
optimistic concurrency versioning to ensure that updates don't clobber one 
another.  But SimDistribStateManager has a bug which prevents this from working 
correctly all the time.  The initial node version in the sim framework is -1, 
which is also the flag used to indicate "I don't care about concurrency, just 
overwrite the node".  (For comparison, ZkDistribStateManager has node versions 
start at 0).  Depending on timing, this causes the default nodeAdded trigger to 
clobber the policy that our test relies on, causing it to fail.

So one fix that'll make this test (and probably others in the sim framework) 
more reliable is to ensure that SimDistribStateManager's node-versioning lines 
up better with ZkDistribStateManager's.  Or at least that it avoids this -1 
edge case.  I've been testing variations of a patch to accomplish this, and 
will upload my results shortly.

> Harden TestSimPolicyCloud
> -------------------------
>
>                 Key: SOLR-13045
>                 URL: https://issues.apache.org/jira/browse/SOLR-13045
>             Project: Solr
>          Issue Type: Test
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: AutoScaling
>    Affects Versions: master (8.0)
>            Reporter: Jason Gerlowski
>            Assignee: Jason Gerlowski
>            Priority: Major
>         Attachments: SOLR-13045.patch, SOLR-13045.patch, jenkins.log.txt.gz
>
>
> Several tests in TestSimPolicyCloud, but especially 
> {{testCreateCollectionAddReplica}}, have some flaky behavior, even after 
> Mark's recent test-fix commit.  This JIRA covers looking into and (hopefully) 
> fixing this test failure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to