[
https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14716934#comment-14716934
]
Shalin Shekhar Mangar commented on SOLR-7569:
---------------------------------------------
Thanks Ishan! A few comments:
# nit - RecoverShardTest has an unused notLeader1 variable
# Shouldn't the "Wait for a long time for a steady state" piece of code be
*before* the proxies for the two replicas are reopened? The LIR state will
surely be set at indexing time and only if the proxy is closed. Also if you
move that wait before the proxy is reopened then you are sure to have the LIR
state as 'down'.
# The check for 'numActiveReplicas' and 'numReplicasOnLiveNodes' should be done
after force refreshing the cluster state of the cloudClient otherwise spurious
failures can happen
# nit - Why is sendDoc overridden in RecoverShardTest? The minRf is same, just
the max retries has been increased and wait between retries has been decreased
# The OCMH.recoverShard() isn't unsetting the leader properly. It should be as
simple as:
{code}
ZkNodeProps m = new ZkNodeProps(Overseer.QUEUE_OPERATION,
OverseerAction.LEADER.toLower(),
ZkStateReader.SHARD_ID_PROP, shardId, ZkStateReader.COLLECTION_PROP,
collection);
Overseer.getInQueue(zkClient).offer(Utils.toJSON(m));
{code}
# Can you please write a test to ensure that this API works with 'async'
parameter?
I think some simple scenarios are not being taken care of. This command only
helps if there a LIR node exists but we can do a bit more:
# Leader is live but 'down' -> mark it 'active'
# Leader itself is in LIR -> delete the LIR node
# Leader is not live:
## Replicas are live but 'down' or 'recovering' -> mark them 'active'
## Replicas are live but in LIR -> delete the LIR nodes
Can you please add some tests exercising each of the above scenarios?
bq. I also tried to mark just one of the replicas as active instead of all the
replicas, hoping it will become leader and others would recover from it.
However, this resulted in one of the other down replicas becoming leader but
still staying down. Looking into why that could be happening; bug?
Did you find out why/how that happened? If this is reproducible, can you please
create an issue and post the test there?
> Create an API to force a leader election between nodes
> ------------------------------------------------------
>
> Key: SOLR-7569
> URL: https://issues.apache.org/jira/browse/SOLR-7569
> Project: Solr
> Issue Type: New Feature
> Components: SolrCloud
> Reporter: Shalin Shekhar Mangar
> Assignee: Shalin Shekhar Mangar
> Labels: difficulty-medium, impact-high
> Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch,
> SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch,
> SOLR-7569_lir_down_state_test.patch
>
>
> There are many reasons why Solr will not elect a leader for a shard e.g. all
> replicas' last published state was recovery or due to bugs which cause a
> leader to be marked as 'down'. While the best solution is that they never get
> into this state, we need a manual way to fix this when it does get into this
> state. Right now we can do a series of dance involving bouncing the node
> (since recovery paths between bouncing and REQUESTRECOVERY are different),
> but that is difficult when running a large cluster. Although it is possible
> that such a manual API may lead to some data loss but in some cases, it is
> the only possible option to restore availability.
> This issue proposes to build a new collection API which can be used to force
> replicas into recovering a leader while avoiding data loss on a best effort
> basis.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]