[
https://issues.apache.org/jira/browse/ZOOKEEPER-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mahadev konar updated ZOOKEEPER-1005:
-------------------------------------
Affects Version/s: (was: 3.4.0)
Fix Version/s: 3.4.0
> Zookeeper servers fail to elect a leader succesfully.
> -----------------------------------------------------
>
> Key: ZOOKEEPER-1005
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1005
> Project: ZooKeeper
> Issue Type: Bug
> Components: quorum
> Affects Versions: 3.2.2
> Environment: zookeeper-3.2.2; debian
> Reporter: Alexandre Hardy
> Fix For: 3.4.0
>
>
> We were running 3 zookeeper servers, and simulated a failure on one of the
> servers.
> The one zookeeper node follows the other, but has trouble connecting. It
> looks like the following exception is the cause:
> {noformat}
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 INFO [zookeeper] --
> [org.apache.zookeeper.server.quorum.QuorumPeer] FOLLOWING
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 INFO [zookeeper] --
> [org.apache.zookeeper.server.ZooKeeperServer] Created server
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 INFO [zookeeper] --
> [org.apache.zookeeper.server.quorum.Follower] Following
> zookeeper3/192.168.131.11:2888
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING [zookeeper] --
> [org.apache.zookeeper.server.quorum.Follower] Unexpected exception, tries=0
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING
> java.net.ConnectException: -- Connection refused
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING -- at
> java.net.PlainSocketImpl.socketConnect(Native Method)
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING -- at
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:310)
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING -- at
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:176)
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING -- at
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:163)
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING -- at
> java.net.SocksSocketImpl.connect(SocksSocketImpl.java:384)
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING -- at
> java.net.Socket.connect(Socket.java:546)
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING -- at
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:156)
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING -- at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:549)
> {noformat}
> The last exception while connecting was:
> {noformat}
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR [zookeeper] --
> [org.apache.zookeeper.server.quorum.Follower] Unexpected exception
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR java.net.ConnectException: --
> Connection refused
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR -- at
> java.net.PlainSocketImpl.socketConnect(Native Method)
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR -- at
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:310)
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR -- at
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:176)
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR -- at
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:163)
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR -- at
> java.net.SocksSocketImpl.connect(SocksSocketImpl.java:384)
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR -- at
> java.net.Socket.connect(Socket.java:546)
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR -- at
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:156)
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR -- at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:549)
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 WARNING [zookeeper] --
> [org.apache.zookeeper.server.quorum.Follower] Exception when following the
> leader
> {noformat}
> The leader started leading a bit later
> {noformat}
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --
> [org.apache.zookeeper.server.quorum.FastLeaderElection] Notification: 0,
> 94489312534, 25, 2, LOOKING, LOOKING, 0
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --
> [org.apache.zookeeper.server.quorum.FastLeaderElection] Adding vote
> 2011-03-01T14:02:32+02:00 e0-cb-4e-65-4d-7d WARNING [zookeeper] --
> [org.apache.zookeeper.server.quorum.QuorumCnxManager] Cannot open channel to
> 1 at election address zookeeper2/192.168.132.10:3888
> 2011-03-01T14:02:32+02:00 e0-cb-4e-65-4d-7d WARNING -- at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:323)
> 2011-03-01T14:02:50+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --
> [org.apache.zookeeper.server.quorum.QuorumPeer] LEADING
> {noformat}
> But at that time the follower had already terminated and started a new
> election, so the leader failed:
> {noformat}
> 2011-03-01T14:02:50+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --
> [org.apache.zookeeper.server.ZooKeeperServer] Created server
> 2011-03-01T14:02:50+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --
> [org.apache.zookeeper.server.persistence.FileSnap] Reading snapshot
> /var/lib/zookeeper/version-2/snapshot.1600007d16
> 2011-03-01T14:02:50+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --
> [org.apache.zookeeper.server.persistence.FileTxnSnapLog] Snapshotting:
> 1600007d16
> 2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING [zookeeper] --
> [org.apache.zookeeper.server.quorum.QuorumCnxManager] Cannot open channel to
> 1 at election address zookeeper2/192.168.132.10:3888
> 2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING -- at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:323)
> 2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING -- at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:302)
> 2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING -- at
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:323)
> 2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING -- at
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:296)
> 2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --
> [org.apache.zookeeper.server.quorum.FastLeaderElection] Sending new
> notification.
> 2011-03-01T14:03:11+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --
> [org.apache.zookeeper.server.quorum.FastLeaderElection] Sending new
> notification.
> 2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING [zookeeper] --
> [org.apache.zookeeper.server.quorum.QuorumCnxManager] Cannot open channel to
> 1 at election address zookeeper2/192.168.132.10:3888
> 2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING -- at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:323)
> 2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING -- at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:302)
> 2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING -- at
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:323)
> 2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING -- at
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:296)
> 2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --
> [org.apache.zookeeper.server.quorum.FastLeaderElection] Sending new
> notification.
> 2011-03-01T14:03:32+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --
> [org.apache.zookeeper.server.quorum.FastLeaderElection] Sending new
> notification.
> 2011-03-01T14:03:34+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --
> [org.apache.zookeeper.server.quorum.Leader] Shutdown called
> 2011-03-01T14:03:34+02:00 e0-cb-4e-65-4d-7d INFO -- at
> org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:371)
> 2011-03-01T14:03:34+02:00 e0-cb-4e-65-4d-7d INFO -- at
> org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:297)
> 2011-03-01T14:03:34+02:00 e0-cb-4e-65-4d-7d INFO -- at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:562)
> {noformat}
> From http://zookeeper.apache.org/doc/r3.2.2/zookeeperStarted.html:
> {quote}
> The new entry, initLimit is timeouts ZooKeeper uses to limit the length of
> time the ZooKeeper servers in quorum have to connect to a leader
> {quote}
> Since we have initLimit=10 and tickTime=4000, we should have 40 seconds for a
> zookeeper server to contact the leader.
> However, in the source code
> src/java/main/org/apache/zookeeper/server/quorum/Follower.java:
> {noformat}
> 152 for (int tries = 0; tries < 5; tries++) {
> 153 try {
> 154 //sock = new Socket();
> 155 //sock.setSoTimeout(self.tickTime *
> self.initLimit);
> 156 sock.connect(addr, self.tickTime *
> self.syncLimit);
> 157 sock.setTcpNoDelay(nodelay);
> 158 break;
> 159 } catch (IOException e) {
> 160 if (tries == 4) {
> 161 LOG.error("Unexpected exception",e);
> 162 throw e;
> 163 } else {
> 164 LOG.warn("Unexpected exception,
> tries="+tries,e);
> 165 sock = new Socket();
> 166 sock.setSoTimeout(self.tickTime *
> self.initLimit);
> 167 }
> 168 }
> 169 Thread.sleep(1000);
> 170 }
> {noformat}
> It appears as if we only have 4 seconds to contact the leader. The timeouts
> are applied to the socket, but do not take into account that the zookeeper
> leader may not have started its zookeeper service yet.
> Is this the expected behaviour? Or is the expected behaviour that followers
> should always be able to connect to the leader?
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira