[
https://issues.apache.org/jira/browse/ZOOKEEPER-3909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Saswati updated ZOOKEEPER-3909:
-------------------------------
Description:
When we restart a zookeeper, it doesn't successfully join the cluster and start
serving clients. We see the zookeeper services starts successfully, but it
stays ideal and throws the message: "This ZooKeeper instance is not currently
serving requests"
The Zookeeper cluster size is 5. Whenever we feel the need of restarting the
zookeepers, we do one at a time. There are two ways we restart the zookeepers,
# just stop the services and start it back up again.
# stop the services, replace the host, and start it back up again.
And, in both the cases we see the same issue.
-----------
When investigated the zookeepers logs, we see the below errors/warnings,
"[QuorumPeer[myid=1](plain=x.x.x.x:0000)(secure=disabled)] WARN
org.apache.zookeeper.server.quorum.Learner - Exception when following the leader
[java.io|http://java.io/].IOException: Leaders epoch, xx is less than accepted
epoch, xy"
-------------------------
But, when we check the current epoch of the leader is always same as the
accepted epoch, which is also matches of the zookeeper we are trying to bring
back to the quorum.
------------------------
Also, when we get the Zxid of every quorum member, they have the same first
byte; only the last two numbers change, so we can safely assume that they are
in sync, I guess.
Somehow this zookeeper that we re restarting sees an advancing of the epoch and
shuts down as a follower.
--------------
The current solution we have at the moment for this issue is,
stop the zookeeper services --> rename the current zookeeper data directory
(version-2) --> start it backup again.
It immediately joins the cluster as a follower as it doesn't have any idea of
the epoch and start serving clients.
----------
was:
When we restart a zookeeper, it doesn't successfully join the cluster and start
serving clients. We see the zookeeper services starts successfully, but it
stays ideal and throws the message: "This ZooKeeper instance is not currently
serving requests"
The Zookeeper cluster size is 5. Whenever we feel the need of restarting the
zookeepers, we do one at a time. There are two ways we restart the zookeepers,
# just stop the services and start it back up again.
# stop the services, replace the host, and start it back up again.
And, in both the cases we see the same issue.
-----------
When investigated the zookeepers logs, we see the below errors/warnings,
"[QuorumPeer[myid=1](plain=x.x.x.x:0000)(secure=disabled)] WARN
org.apache.zookeeper.server.quorum.Learner - Exception when following the leader
[java.io|http://java.io/].IOException: Leaders epoch, xx is less than accepted
epoch, xy"
-------------------------
But, when we check the current epoch of the leader is always same as the
accepted epoch.
------------------------
Also, when we get the Zxid of every quorum member, they have the same first
byte; only the last two numbers change, so we can safely assume that they are
in sync, I guess.
Somehow this zookeeper that we re restarting sees an advancing of the epoch and
shuts down as a follower.
--------------
The current solution we have at the moment for this issue is,
stop the zookeeper services --> rename the current zookeeper data directory
(version-2) --> start it backup again.
It immediately joins the cluster as a follower as it doesn't have any idea of
the epoch and start serving clients.
----------
> Zookeeper Unable to Join the Cluster after it is Restarted
> -----------------------------------------------------------
>
> Key: ZOOKEEPER-3909
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3909
> Project: ZooKeeper
> Issue Type: Bug
> Affects Versions: 3.5.7
> Environment: All Environments
> Reporter: Saswati
> Priority: Critical
>
> When we restart a zookeeper, it doesn't successfully join the cluster and
> start serving clients. We see the zookeeper services starts successfully, but
> it stays ideal and throws the message: "This ZooKeeper instance is not
> currently serving requests"
> The Zookeeper cluster size is 5. Whenever we feel the need of restarting the
> zookeepers, we do one at a time. There are two ways we restart the zookeepers,
> # just stop the services and start it back up again.
> # stop the services, replace the host, and start it back up again.
> And, in both the cases we see the same issue.
> -----------
> When investigated the zookeepers logs, we see the below errors/warnings,
> "[QuorumPeer[myid=1](plain=x.x.x.x:0000)(secure=disabled)] WARN
> org.apache.zookeeper.server.quorum.Learner - Exception when following the
> leader
> [java.io|http://java.io/].IOException: Leaders epoch, xx is less than
> accepted epoch, xy"
> -------------------------
> But, when we check the current epoch of the leader is always same as the
> accepted epoch, which is also matches of the zookeeper we are trying to bring
> back to the quorum.
> ------------------------
> Also, when we get the Zxid of every quorum member, they have the same first
> byte; only the last two numbers change, so we can safely assume that they are
> in sync, I guess.
> Somehow this zookeeper that we re restarting sees an advancing of the epoch
> and shuts down as a follower.
> --------------
> The current solution we have at the moment for this issue is,
> stop the zookeeper services --> rename the current zookeeper data directory
> (version-2) --> start it backup again.
> It immediately joins the cluster as a follower as it doesn't have any idea of
> the epoch and start serving clients.
> ----------
--
This message was sent by Atlassian Jira
(v8.3.4#803005)