[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enrico Olivelli updated ZOOKEEPER-1005:
---------------------------------------
    Fix Version/s:     (was: 3.5.6)

> Zookeeper servers fail to elect a leader succesfully.
> -----------------------------------------------------
>
>                 Key: ZOOKEEPER-1005
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1005
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.2.2
>         Environment: zookeeper-3.2.2; debian
>            Reporter: Alexandre Hardy
>            Priority: Major
>             Fix For: 3.6.0, 3.5.7
>
>
> We were running 3 zookeeper servers, and simulated a failure on one of the 
> servers. 
> The one zookeeper node follows the other, but has trouble connecting. It 
> looks like the following exception is the cause:
> {noformat}
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 INFO [zookeeper] --  
> [org.apache.zookeeper.server.quorum.QuorumPeer] FOLLOWING
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 INFO [zookeeper] --  
> [org.apache.zookeeper.server.ZooKeeperServer] Created server
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 INFO [zookeeper] --  
> [org.apache.zookeeper.server.quorum.Follower] Following 
> zookeeper3/192.168.131.11:2888
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING [zookeeper] --  
> [org.apache.zookeeper.server.quorum.Follower] Unexpected exception, tries=0
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING 
> java.net.ConnectException: --  Connection refused
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING  --     at 
> java.net.PlainSocketImpl.socketConnect(Native Method)
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING  --     at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:310)
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING  --     at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:176)
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING  --     at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:163)
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING  --     at 
> java.net.SocksSocketImpl.connect(SocksSocketImpl.java:384)
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING  --     at 
> java.net.Socket.connect(Socket.java:546)
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING  --     at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:156)
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING  --     at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:549)
> {noformat}
> The last exception while connecting was:
> {noformat}
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR [zookeeper] --  
> [org.apache.zookeeper.server.quorum.Follower] Unexpected exception
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR java.net.ConnectException: -- 
>  Connection refused
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR  --     at 
> java.net.PlainSocketImpl.socketConnect(Native Method)
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR  --     at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:310)
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR  --     at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:176)
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR  --     at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:163)
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR  --     at 
> java.net.SocksSocketImpl.connect(SocksSocketImpl.java:384)
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR  --     at 
> java.net.Socket.connect(Socket.java:546)
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR  --     at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:156)
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR  --     at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:549)
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 WARNING [zookeeper] --  
> [org.apache.zookeeper.server.quorum.Follower] Exception when following the 
> leader
> {noformat}
> The leader started leading a bit later 
> {noformat}
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  
> [org.apache.zookeeper.server.quorum.FastLeaderElection] Notification: 0, 
> 94489312534, 25, 2, LOOKING, LOOKING, 0
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  
> [org.apache.zookeeper.server.quorum.FastLeaderElection] Adding vote
> 2011-03-01T14:02:32+02:00 e0-cb-4e-65-4d-7d WARNING [zookeeper] --  
> [org.apache.zookeeper.server.quorum.QuorumCnxManager] Cannot open channel to 
> 1 at election address zookeeper2/192.168.132.10:3888
> 2011-03-01T14:02:32+02:00 e0-cb-4e-65-4d-7d WARNING  --     at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:323)
> 2011-03-01T14:02:50+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  
> [org.apache.zookeeper.server.quorum.QuorumPeer] LEADING
> {noformat}
> But at that time the follower had already terminated and started a new 
> election, so the leader failed:
> {noformat}
> 2011-03-01T14:02:50+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  
> [org.apache.zookeeper.server.ZooKeeperServer] Created server
> 2011-03-01T14:02:50+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  
> [org.apache.zookeeper.server.persistence.FileSnap] Reading snapshot 
> /var/lib/zookeeper/version-2/snapshot.1600007d16
> 2011-03-01T14:02:50+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  
> [org.apache.zookeeper.server.persistence.FileTxnSnapLog] Snapshotting: 
> 1600007d16
> 2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING [zookeeper] --  
> [org.apache.zookeeper.server.quorum.QuorumCnxManager] Cannot open channel to 
> 1 at election address zookeeper2/192.168.132.10:3888
> 2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING  --     at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:323)
> 2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING  --     at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:302)
> 2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING  --     at 
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:323)
> 2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING  --     at 
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:296)
> 2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  
> [org.apache.zookeeper.server.quorum.FastLeaderElection] Sending new 
> notification.
> 2011-03-01T14:03:11+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  
> [org.apache.zookeeper.server.quorum.FastLeaderElection] Sending new 
> notification.
> 2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING [zookeeper] --  
> [org.apache.zookeeper.server.quorum.QuorumCnxManager] Cannot open channel to 
> 1 at election address zookeeper2/192.168.132.10:3888
> 2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING  --     at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:323)
> 2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING  --     at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:302)
> 2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING  --     at 
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:323)
> 2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING  --     at 
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:296)
> 2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  
> [org.apache.zookeeper.server.quorum.FastLeaderElection] Sending new 
> notification.
> 2011-03-01T14:03:32+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  
> [org.apache.zookeeper.server.quorum.FastLeaderElection] Sending new 
> notification.
> 2011-03-01T14:03:34+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  
> [org.apache.zookeeper.server.quorum.Leader] Shutdown called
> 2011-03-01T14:03:34+02:00 e0-cb-4e-65-4d-7d INFO  --     at 
> org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:371)
> 2011-03-01T14:03:34+02:00 e0-cb-4e-65-4d-7d INFO  --     at 
> org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:297)
> 2011-03-01T14:03:34+02:00 e0-cb-4e-65-4d-7d INFO  --     at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:562)
> {noformat}
> From http://zookeeper.apache.org/doc/r3.2.2/zookeeperStarted.html:
> {quote}
> The new entry, initLimit is timeouts ZooKeeper uses to limit the length of 
> time the ZooKeeper servers in quorum have to connect to a leader
> {quote}
> Since we have initLimit=10 and tickTime=4000, we should have 40 seconds for a 
> zookeeper server to contact the leader.
> However, in the source code 
> src/java/main/org/apache/zookeeper/server/quorum/Follower.java:
> {noformat}
> 152                 for (int tries = 0; tries < 5; tries++) {
> 153                     try {
> 154                         //sock = new Socket();
> 155                         //sock.setSoTimeout(self.tickTime * 
> self.initLimit);
> 156                         sock.connect(addr, self.tickTime * 
> self.syncLimit);
> 157                         sock.setTcpNoDelay(nodelay);
> 158                         break;
> 159                     } catch (IOException e) {
> 160                         if (tries == 4) {
> 161                             LOG.error("Unexpected exception",e);
> 162                             throw e;
> 163                         } else {
> 164                             LOG.warn("Unexpected exception, 
> tries="+tries,e);
> 165                             sock = new Socket();
> 166                             sock.setSoTimeout(self.tickTime * 
> self.initLimit);
> 167                         }
> 168                     }
> 169                     Thread.sleep(1000);
> 170                 }
> {noformat}
> It appears as if we only have 4 seconds to contact the leader. The timeouts 
> are applied to the socket, but do not take into account that the zookeeper 
> leader may not have started its zookeeper service yet. 
> Is this the expected behaviour? Or is the expected behaviour that followers 
> should always be able to connect to the leader?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to