[ https://issues.apache.org/jira/browse/SOLR-13045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16713023#comment-16713023 ]
Jason Gerlowski commented on SOLR-13045: ---------------------------------------- I believe I found the race condition causing these failures. It looks like an issue between the {{waitForState}} polling, which occurs in the main test thread, and the leader-election execution, which occurs in a {{Future}} submitted to {{SimCloudManager}}'s ExecutorService. The {{waitForState}} thread repeatedly asks for the cluster state, which looks a bit like this: * [return cached value, if any. Otherwise continue|https://github.com/apache/lucene-solr/blob/75b183196798232aa6f2dcaaaab117f309119053/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimClusterStateProvider.java#L2090] * [Grab lock|https://github.com/apache/lucene-solr/blob/75b183196798232aa6f2dcaaaab117f309119053/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimClusterStateProvider.java#L2093] * [Clear cache|https://github.com/apache/lucene-solr/blob/75b183196798232aa6f2dcaaaab117f309119053/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimClusterStateProvider.java#L2094] * [Build Map to store in cache|https://github.com/apache/lucene-solr/blob/75b183196798232aa6f2dcaaaab117f309119053/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimClusterStateProvider.java#L2126] * [Set cache with Map|https://github.com/apache/lucene-solr/blob/75b183196798232aa6f2dcaaaab117f309119053/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimClusterStateProvider.java#L2141] * [Release lock|https://github.com/apache/lucene-solr/blob/75b183196798232aa6f2dcaaaab117f309119053/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimClusterStateProvider.java#L2144] The Leader Election Future looks a bit like this: * [Give a ReplicaInfo "leader=true"|https://github.com/apache/lucene-solr/blob/75b183196798232aa6f2dcaaaab117f309119053/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimClusterStateProvider.java#L756] * [Clear cache|https://github.com/apache/lucene-solr/blob/75b183196798232aa6f2dcaaaab117f309119053/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimClusterStateProvider.java#L766] Note that the leader election Future does this without acquiring the lock. Now imagine the following interleaving of these two threads: * [Thread-Test] Grab lock * [Thread-Test] Clear cache * [Thread-Test] Build Map to store in cache * [Thread-LeaderElection] Give ReplicaInfo "leader=true" * [Thread-LeaderElection] Clear cache * [Thread-Test] Set cache with Map At the end of this interleaving the cache has a value that's missing the latest "leader=true" changes, and nothing will ever clear it. So the {{waitForState}} polling will go on to fail. We should be able to fix this by having the leader election code use the same Lock used elsewhere. I've actually got this change staged locally and am running tests on it currently. If all looks well I should have this uploaded soon. One thing I'll be curious to see is whether this affects any of the other TestSim* failures we've seen recently. If we're lucky we may get 2 (or more) birds with this one stone. > Harden TestSimPolicyCloud > ------------------------- > > Key: SOLR-13045 > URL: https://issues.apache.org/jira/browse/SOLR-13045 > Project: Solr > Issue Type: Test > Security Level: Public(Default Security Level. Issues are Public) > Components: AutoScaling > Affects Versions: master (8.0) > Reporter: Jason Gerlowski > Assignee: Jason Gerlowski > Priority: Major > > Several tests in TestSimPolicyCloud, but especially > {{testCreateCollectionAddReplica}}, have some flaky behavior, even after > Mark's recent test-fix commit. This JIRA covers looking into and (hopefully) > fixing this test failure. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org