Yohann Callea created SOLR-17331:
------------------------------------
Summary:
MigrateReplicasTest.testGoodSpreadDuringAssignWithNoTarget is flaky
Key: SOLR-17331
URL: https://issues.apache.org/jira/browse/SOLR-17331
Project: Solr
Issue Type: Test
Security Level: Public (Default Security Level. Issues are Public)
Components: SolrCloud
Reporter: Yohann Callea
The test *_MigrateReplicasTest.testGoodSpreadDuringAssignWithNoTarget_* is
sometimes (< 3% failure rate) failing on its last assertion, as shows the
[trend history of test
failures|[http://fucit.org/solr-jenkins-reports/history-trend-of-recent-failures.html#series/org.apache.solr.cloud.MigrateReplicasTest.testGoodSpreadDuringAssignWithNoTarget].]
This test spins off a 5 nodes cluster, creates a collection with 3 shards and a
replication factor of 2.
It then vacate 2 randomly chosen nodes using the Migrate Replicas command and,
after the migration completion, expect the vacated node to be assigned no
replicas and the 6 replicas to be evenly spread across the 3 non-vacated nodes
(i.e., 2 replicas positioned on each node).
However, this last assertion happen to fail as the replicas are sometimes not
evenly spread over the 3 non-vacated nodes.
{code:java}
The non-source node '127.0.0.1:36007_solr' has the wrong number of replicas
after the migration expected:<2> but was:<1> {code}
If we analyse more in detail a failure situation, it appears that this test is
inherently expected to fail under some circumstances, given how the Migrate
Replicas command operate.
When migrating replicas, the new position of the replicas to be moved are
calculated sequentially and, for every consecutive move, the position is
decided according to the logic implemented by the replica placement plugin
currently configured.
We can therefore end up in the following situation.
h2. Failing scenario
Let's assume the following initial state, after the collection creation.
Note that this test always uses the default replica placement strategy, which
is Simple as of today.
{code:java}
| NODE_0 | NODE_1 | NODE_2 | NODE_3 | NODE_4 |
--------+---------+---------+---------+---------+---------+
SHARD_1 | X | | | X | |
SHARD_2 | | X | | X | |
SHARD_3 | | | X | | X | {code}
The test now runs the migrate command to vacate *_NODE_3_* and {*}_NODE_4_{*}.
It therefore needs to go through 3 replica movements for emptying these two
nodes.
h4. Move 1
We are moving the replica of *_SHARD_1_* positioned on {*}_NODE_3_{*}.
_*NODE_0*_ is not an eligible destination for this replica as this node is
already assigned a replica of {*}_SHARD_1_{*}, and both *_NODE_1_* and
_*NODE_2*_ can be chosen as they host the same number of replicas.
*_NODE_1_* is arbitrarily chosen amongst the two best candidate destination
nodes.
{code:java}
| NODE_0 | NODE_1 | NODE_2 | NODE_3 | NODE_4 |
--------+---------+---------+---------+---------+---------+
SHARD_1 | X | X | | | |
SHARD_2 | | X | | X | |
SHARD_3 | | | X | | X | {code}
h4. Move 2
We are moving the replica of *_SHARD_2_* positioned on {*}_NODE_3_{*}.
_*NODE_1*_ is not an eligible destination for this replica as this node is
already assigned a replica of {*}_SHARD_2_{*}, and both *_NODE_0_* and
_*NODE_2*_ can be chosen as they host the same number of replicas.
*_NODE_0_* is arbitrarily chosen amongst the two best candidate destination
nodes.
{code:java}
| NODE_0 | NODE_1 | NODE_2 | NODE_3 | NODE_4 |
--------+---------+---------+---------+---------+---------+
SHARD_1 | X | X | | | |
SHARD_2 | X | X | | | |
SHARD_3 | | | X | | X |{code}
h4. Move 3
We are moving the replica of *_SHARD_3_* positioned on {*}_NODE_4_{*}.
_*NODE_2*_ is not an eligible destination for this replica as this node is
already assigned a replica of {*}_SHARD_3_{*}, and both *_NODE_0_* and
_*NODE_1*_ can be chosen as they host the same number of replicas.
*_NODE_1_* is arbitrarily chosen amongst the two best candidate destination
nodes.
{code:java}
| NODE_0 | NODE_1 | NODE_2 | NODE_3 | NODE_4 |
--------+---------+---------+---------+---------+---------+
SHARD_1 | X | X | | | |
SHARD_2 | X | X | | | |
SHARD_3 | | X | X | | |{code}
The test will then fail as the replicas are not evenly positioned across the
non-vacated nodes, while it is arguably the expected outcome in the current
situation given the Simple placement strategy implementation.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]