[
https://issues.apache.org/jira/browse/SOLR-5593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13889416#comment-13889416
]
Christine Poerschke commented on SOLR-5593:
-------------------------------------------
Uploaded https://github.com/apache/lucene-solr/pull/27 which rather than
relaxing the error handling for the getLeaderRetry call actually tries to
completely avoid it in the first place (if circumstances seem to permit it i.e.
the request said it came from the leader and we don't think we are leader and
we could not be sub-shard leader).
> shard leader loss due to ZK session expiry
> ------------------------------------------
>
> Key: SOLR-5593
> URL: https://issues.apache.org/jira/browse/SOLR-5593
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Reporter: Christine Poerschke
> Assignee: Mark Miller
> Fix For: 5.0, 4.7
>
> Attachments: CoreAdminHandler.patch
>
>
> The problem we saw was that the shard leader ceased to be shard leader (in
> our case due to its zookeeper session expiring). The followers thus rejected
> update requests (DistributedUpdateProcessor setupRequest's call to
> ZkStateReader getLeaderRetry) and the leader asked them to recover
> (DistributedUpdateProcessor doFinish). The followers published themselves as
> recovering (CoreAdminHandler handleRequestRecoveryAction) and the shard
> leader loss triggered an election in which none of the followers became the
> leader due to their recovering state (ShardLeaderElectionContext
> shouldIBeLeader). The former shard leader also did not become shard leader
> because its new seq number placed it after the existing replicas
> (LeaderElector checkIfIamLeader seq <= intSeqs.get(0)).
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]