[
https://issues.apache.org/jira/browse/SOLR-13045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16713023#comment-16713023
]
Jason Gerlowski commented on SOLR-13045:
----------------------------------------
I believe I found the race condition causing these failures. It looks like an
issue between the {{waitForState}} polling, which occurs in the main test
thread, and the leader-election execution, which occurs in a {{Future}}
submitted to {{SimCloudManager}}'s ExecutorService.
The {{waitForState}} thread repeatedly asks for the cluster state, which looks
a bit like this:
* [return cached value, if any. Otherwise
continue|https://github.com/apache/lucene-solr/blob/75b183196798232aa6f2dcaaaab117f309119053/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimClusterStateProvider.java#L2090]
* [Grab
lock|https://github.com/apache/lucene-solr/blob/75b183196798232aa6f2dcaaaab117f309119053/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimClusterStateProvider.java#L2093]
* [Clear
cache|https://github.com/apache/lucene-solr/blob/75b183196798232aa6f2dcaaaab117f309119053/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimClusterStateProvider.java#L2094]
* [Build Map to store in
cache|https://github.com/apache/lucene-solr/blob/75b183196798232aa6f2dcaaaab117f309119053/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimClusterStateProvider.java#L2126]
* [Set cache with
Map|https://github.com/apache/lucene-solr/blob/75b183196798232aa6f2dcaaaab117f309119053/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimClusterStateProvider.java#L2141]
* [Release
lock|https://github.com/apache/lucene-solr/blob/75b183196798232aa6f2dcaaaab117f309119053/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimClusterStateProvider.java#L2144]
The Leader Election Future looks a bit like this:
* [Give a ReplicaInfo
"leader=true"|https://github.com/apache/lucene-solr/blob/75b183196798232aa6f2dcaaaab117f309119053/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimClusterStateProvider.java#L756]
* [Clear
cache|https://github.com/apache/lucene-solr/blob/75b183196798232aa6f2dcaaaab117f309119053/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimClusterStateProvider.java#L766]
Note that the leader election Future does this without acquiring the lock. Now
imagine the following interleaving of these two threads:
* [Thread-Test] Grab lock
* [Thread-Test] Clear cache
* [Thread-Test] Build Map to store in cache
* [Thread-LeaderElection] Give ReplicaInfo "leader=true"
* [Thread-LeaderElection] Clear cache
* [Thread-Test] Set cache with Map
At the end of this interleaving the cache has a value that's missing the latest
"leader=true" changes, and nothing will ever clear it. So the {{waitForState}}
polling will go on to fail.
We should be able to fix this by having the leader election code use the same
Lock used elsewhere. I've actually got this change staged locally and am
running tests on it currently. If all looks well I should have this uploaded
soon. One thing I'll be curious to see is whether this affects any of the other
TestSim* failures we've seen recently. If we're lucky we may get 2 (or more)
birds with this one stone.
> Harden TestSimPolicyCloud
> -------------------------
>
> Key: SOLR-13045
> URL: https://issues.apache.org/jira/browse/SOLR-13045
> Project: Solr
> Issue Type: Test
> Security Level: Public(Default Security Level. Issues are Public)
> Components: AutoScaling
> Affects Versions: master (8.0)
> Reporter: Jason Gerlowski
> Assignee: Jason Gerlowski
> Priority: Major
>
> Several tests in TestSimPolicyCloud, but especially
> {{testCreateCollectionAddReplica}}, have some flaky behavior, even after
> Mark's recent test-fix commit. This JIRA covers looking into and (hopefully)
> fixing this test failure.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]