[jira] [Updated] (ZOOKEEPER-3909) Zookeeper Unable to Join the Cluster after it is Restarted

Saswati (Jira) Fri, 07 Aug 2020 10:12:29 -0700


     [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Saswati updated ZOOKEEPER-3909:
-------------------------------
    Description: 
When we restart a zookeeper, it doesn't successfully join the cluster and start 
serving clients. We see the zookeeper services starts successfully, but it 
stays ideal and throws the message: "This ZooKeeper instance is not currently 
serving requests"

The Zookeeper cluster size is 5. Whenever we feel the need of restarting the 
zookeepers, we do one at a time. There are two ways we restart the zookeepers,
 # just stop the services and start it back up again.
 # stop the services, replace the host, and start it back up again.

And, in both the cases we see the same issue.

-----------

When investigated the zookeepers logs, we see the below errors/warnings,

"[QuorumPeer[myid=1](plain=x.x.x.x:0000)(secure=disabled)] WARN  
org.apache.zookeeper.server.quorum.Learner - Exception when following the leader
 [java.io|http://java.io/].IOException: Leaders epoch, xx is less than accepted 
epoch, xy"

-------------------------

But, when we check the current epoch of the leader is always same as the 
accepted epoch, which is also matches of the zookeeper we are trying to bring 
back to the quorum.

------------------------

Also, when we get the Zxid of every quorum member, they have the same first 
byte; only the last two numbers change, so we can safely assume that they are 
in sync, I guess.

Somehow this zookeeper that we re restarting sees an advancing of the epoch and 
shuts down as a follower.

--------------

The current solution we have at the moment for this issue is,

stop the zookeeper services --> rename the current zookeeper data directory 
(version-2) --> start it backup again.

It immediately joins the cluster as a follower as it doesn't have any idea of 
the epoch and start serving clients. 

----------

  was:
When we restart a zookeeper, it doesn't successfully join the cluster and start 
serving clients. We see the zookeeper services starts successfully, but it 
stays ideal and throws the message: "This ZooKeeper instance is not currently 
serving requests"

The Zookeeper cluster size is 5. Whenever we feel the need of restarting the 
zookeepers, we do one at a time. There are two ways we restart the zookeepers,
 # just stop the services and start it back up again.
 # stop the services, replace the host, and start it back up again.

And, in both the cases we see the same issue.

-----------

When investigated the zookeepers logs, we see the below errors/warnings,

"[QuorumPeer[myid=1](plain=x.x.x.x:0000)(secure=disabled)] WARN  
org.apache.zookeeper.server.quorum.Learner - Exception when following the leader
[java.io|http://java.io/].IOException: Leaders epoch, xx is less than accepted 
epoch, xy"

-------------------------

But, when we check the current epoch of the leader is always same as the 
accepted epoch.

------------------------

Also, when we get the Zxid of every quorum member, they have the same first 
byte; only the last two numbers change, so we can safely assume that they are 
in sync, I guess.

Somehow this zookeeper that we re restarting sees an advancing of the epoch and 
shuts down as a follower.

--------------

The current solution we have at the moment for this issue is,

stop the zookeeper services --> rename the current zookeeper data directory 
(version-2) --> start it backup again.

It immediately joins the cluster as a follower as it doesn't have any idea of 
the epoch and start serving clients. 

----------


> Zookeeper Unable to Join the Cluster after it is Restarted 
> -----------------------------------------------------------
>
>                 Key: ZOOKEEPER-3909
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3909
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.5.7
>         Environment: All Environments 
>            Reporter: Saswati
>            Priority: Critical
>
> When we restart a zookeeper, it doesn't successfully join the cluster and 
> start serving clients. We see the zookeeper services starts successfully, but 
> it stays ideal and throws the message: "This ZooKeeper instance is not 
> currently serving requests"
> The Zookeeper cluster size is 5. Whenever we feel the need of restarting the 
> zookeepers, we do one at a time. There are two ways we restart the zookeepers,
>  # just stop the services and start it back up again.
>  # stop the services, replace the host, and start it back up again.
> And, in both the cases we see the same issue.
> -----------
> When investigated the zookeepers logs, we see the below errors/warnings,
> "[QuorumPeer[myid=1](plain=x.x.x.x:0000)(secure=disabled)] WARN  
> org.apache.zookeeper.server.quorum.Learner - Exception when following the 
> leader
>  [java.io|http://java.io/].IOException: Leaders epoch, xx is less than 
> accepted epoch, xy"
> -------------------------
> But, when we check the current epoch of the leader is always same as the 
> accepted epoch, which is also matches of the zookeeper we are trying to bring 
> back to the quorum.
> ------------------------
> Also, when we get the Zxid of every quorum member, they have the same first 
> byte; only the last two numbers change, so we can safely assume that they are 
> in sync, I guess.
> Somehow this zookeeper that we re restarting sees an advancing of the epoch 
> and shuts down as a follower.
> --------------
> The current solution we have at the moment for this issue is,
> stop the zookeeper services --> rename the current zookeeper data directory 
> (version-2) --> start it backup again.
> It immediately joins the cluster as a follower as it doesn't have any idea of 
> the epoch and start serving clients. 
> ----------



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ZOOKEEPER-3909) Zookeeper Unable to Join the Cluster after it is Restarted

Reply via email to