[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saswati updated ZOOKEEPER-3909:
-------------------------------
    Summary: Zookeeper Unable to Join the Cluster after it is Restarted; Error: 
"This ZooKeeper instance is not currently serving requests"  (was: Zookeeper 
Unable to Join the Cluster after it is Restarted )

> Zookeeper Unable to Join the Cluster after it is Restarted; Error: "This 
> ZooKeeper instance is not currently serving requests"
> ------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-3909
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3909
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.5.7
>         Environment: All Environments 
>            Reporter: Saswati
>            Priority: Critical
>
> When we restart a zookeeper, it doesn't successfully join the cluster and 
> start serving clients. We see the zookeeper services starts successfully, but 
> it stays ideal and throws the message: "This ZooKeeper instance is not 
> currently serving requests"
> The Zookeeper cluster size is 5. Whenever we feel the need of restarting the 
> zookeepers, we do one at a time. There are two ways we restart the zookeepers,
>  # just stop the services and start it back up again.
>  # stop the services, replace the host, and start it back up again.
> And, in both the cases we see the same issue.
> -----------
> When investigated the zookeepers logs, we see the below errors/warnings,
> "[QuorumPeer[myid=1](plain=x.x.x.x:0000)(secure=disabled)] WARN  
> org.apache.zookeeper.server.quorum.Learner - Exception when following the 
> leader
>  [java.io|http://java.io/].IOException: Leaders epoch, xx is less than 
> accepted epoch, xy"
> -------------------------
> But, when we check the current epoch of the leader is always same as the 
> accepted epoch, which is also matches of the zookeeper we are trying to bring 
> back to the quorum.
> ------------------------
> Also, when we get the Zxid of every quorum member, they have the same first 
> byte; only the last two numbers change, so we can safely assume that they are 
> in sync, I guess.
> Somehow this zookeeper that we re restarting sees an advancing of the epoch 
> and shuts down as a follower.
> --------------
> The current solution we have at the moment for this issue is,
> stop the zookeeper services --> rename the current zookeeper data directory 
> (version-2) --> start it backup again.
> It immediately joins the cluster as a follower as it doesn't have any idea of 
> the epoch and start serving clients. 
> ----------



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to