[ 
https://issues.apache.org/jira/browse/SOLR-13045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16713023#comment-16713023
 ] 

Jason Gerlowski commented on SOLR-13045:
----------------------------------------

I believe I found the race condition causing these failures. It looks like an 
issue between the {{waitForState}} polling, which occurs in the main test 
thread, and the leader-election execution, which occurs in a {{Future}} 
submitted to {{SimCloudManager}}'s ExecutorService.

The {{waitForState}} thread repeatedly asks for the cluster state, which looks 
a bit like this:
 * [return cached value, if any. Otherwise 
continue|https://github.com/apache/lucene-solr/blob/75b183196798232aa6f2dcaaaab117f309119053/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimClusterStateProvider.java#L2090]
 * [Grab 
lock|https://github.com/apache/lucene-solr/blob/75b183196798232aa6f2dcaaaab117f309119053/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimClusterStateProvider.java#L2093]
 * [Clear 
cache|https://github.com/apache/lucene-solr/blob/75b183196798232aa6f2dcaaaab117f309119053/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimClusterStateProvider.java#L2094]
 * [Build Map to store in 
cache|https://github.com/apache/lucene-solr/blob/75b183196798232aa6f2dcaaaab117f309119053/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimClusterStateProvider.java#L2126]
 * [Set cache with 
Map|https://github.com/apache/lucene-solr/blob/75b183196798232aa6f2dcaaaab117f309119053/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimClusterStateProvider.java#L2141]
 * [Release 
lock|https://github.com/apache/lucene-solr/blob/75b183196798232aa6f2dcaaaab117f309119053/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimClusterStateProvider.java#L2144]

The Leader Election Future looks a bit like this:
 * [Give a ReplicaInfo 
"leader=true"|https://github.com/apache/lucene-solr/blob/75b183196798232aa6f2dcaaaab117f309119053/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimClusterStateProvider.java#L756]
 * [Clear 
cache|https://github.com/apache/lucene-solr/blob/75b183196798232aa6f2dcaaaab117f309119053/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimClusterStateProvider.java#L766]

Note that the leader election Future does this without acquiring the lock. Now 
imagine the following interleaving of these two threads:
 * [Thread-Test] Grab lock
 * [Thread-Test] Clear cache
 * [Thread-Test] Build Map to store in cache
 * [Thread-LeaderElection] Give ReplicaInfo "leader=true"
 * [Thread-LeaderElection] Clear cache
 * [Thread-Test] Set cache with Map

At the end of this interleaving the cache has a value that's missing the latest 
"leader=true" changes, and nothing will ever clear it. So the {{waitForState}} 
polling will go on to fail.

We should be able to fix this by having the leader election code use the same 
Lock used elsewhere. I've actually got this change staged locally and am 
running tests on it currently. If all looks well I should have this uploaded 
soon. One thing I'll be curious to see is whether this affects any of the other 
TestSim* failures we've seen recently. If we're lucky we may get 2 (or more) 
birds with this one stone.

> Harden TestSimPolicyCloud
> -------------------------
>
>                 Key: SOLR-13045
>                 URL: https://issues.apache.org/jira/browse/SOLR-13045
>             Project: Solr
>          Issue Type: Test
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: AutoScaling
>    Affects Versions: master (8.0)
>            Reporter: Jason Gerlowski
>            Assignee: Jason Gerlowski
>            Priority: Major
>
> Several tests in TestSimPolicyCloud, but especially 
> {{testCreateCollectionAddReplica}}, have some flaky behavior, even after 
> Mark's recent test-fix commit.  This JIRA covers looking into and (hopefully) 
> fixing this test failure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to