[
https://issues.apache.org/jira/browse/SOLR-11472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16243764#comment-16243764
]
Shalin Shekhar Mangar commented on SOLR-11472:
----------------------------------------------
Here's the sequence of events:
{code}
core_node3 is leader for .system collection
Test starts a new node at port 50071
Node Added Trigger fires and a plan is computed.
action=MOVEREPLICA&collection=.system&targetNode=127.0.0.1:50071_solr&replica=core_node3
is processed first and core_node8 is added on port 50071
but before it recovers fully, the leader node core_node3 is unloaded
core_node6 becomes the leader and asks core_node8 to recover
action=MOVEREPLICA&collection=.system&targetNode=127.0.0.1:50071_solr&replica=core_node6
now core_node6 is to be moved and core_node10 is added on port 50071
but before it can recover, core_node6 is also unloaded
system_shard1_replica_n2 on port 49937 becomes the leader and asks
core_node8 and core_node10 to sync with it
but before they can recover the test stops node 49937.
The NodeLostTrigger fires and tries to create a new replica
But leader election cannot happen because no nodes have any data and/or
none of them were active before.
{code}
The crux of the issue is that move replica unloaded the leader before the newly
added replica becomes active. Actually, Andrzej has fixed this problem already
in SOLR-11448. The leader election issue seen in these logs is a known problem
in SolrCloud. Mark Miller created SOLR-7065 to address the gridlock of leader
election in such cases.
I'll audit jenkins again to see if this test has failed since SOLR-11448 was
committed. If not, then I'll close this issue.
> Leader election bug
> -------------------
>
> Key: SOLR-11472
> URL: https://issues.apache.org/jira/browse/SOLR-11472
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Affects Versions: 7.1, master (8.0)
> Reporter: Andrzej Bialecki
> Assignee: Shalin Shekhar Mangar
> Attachments:
> Console_output_of_AutoscalingHistoryHandlerTest_failure.txt
>
>
> SOLR-11407 uncovered a bug in leader election, where the same failing node is
> retried indefinitely.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]