[jira] [Commented] (SOLR-5593) shard leader loss due to ZK session expiry

Mark Miller (JIRA) Tue, 31 Dec 2013 12:45:17 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-5593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13859683#comment-13859683
 ]


Mark Miller commented on SOLR-5593:
-----------------------------------

bq. Yes, we are working on changes to DistributedUpdateProcessor to relax the 
requirement for the getLeaderRetry to succeed within setupRequest (if phase is 
DistribPhase.FROMLEADER and the shard state shows it could not be 
subShardLeader then getLeaderRetry success should be optional).

Yeah, on some thought, this is the right approach I think. Removing the publish 
is actually probably not a good idea. It actually protects us from losing data 
- we don't want a replica that was asked to recover to become the leader - that 
could mean updates were accepted that it is expected to have. If the previous 
leader died before one of the replicas became a leader, that leader might have 
been ahead. In this case, we don't choose a new leader, because you should 
really reboot the whole shard with all the replicas you can to avoid any 
possible data lose.

> shard leader loss due to ZK session expiry
> ------------------------------------------
>
>                 Key: SOLR-5593
>                 URL: https://issues.apache.org/jira/browse/SOLR-5593
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>            Reporter: Christine Poerschke
>            Assignee: Mark Miller
>             Fix For: 5.0, 4.7, 4.6.1
>
>         Attachments: CoreAdminHandler.patch
>
>
> The problem we saw was that the shard leader ceased to be shard leader (in 
> our case due to its zookeeper session expiring). The followers thus rejected 
> update requests (DistributedUpdateProcessor setupRequest's call to 
> ZkStateReader getLeaderRetry) and the leader asked them to recover 
> (DistributedUpdateProcessor doFinish). The followers published themselves as 
> recovering (CoreAdminHandler handleRequestRecoveryAction) and the shard 
> leader loss triggered an election in which none of the followers became the 
> leader due to their recovering state (ShardLeaderElectionContext 
> shouldIBeLeader). The former shard leader also did not become shard leader 
> because its new seq number placed it after the existing replicas 
> (LeaderElector checkIfIamLeader seq <= intSeqs.get(0)).



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-5593) shard leader loss due to ZK session expiry

Reply via email to