[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13001101#comment-13001101
 ] 

Vishal K commented on ZOOKEEPER-1005:
-------------------------------------

Hi Alexandre,

Can you attach the zookeeper logs and zoo.cfg files? Also, please describe your 
test in detail.

I am not 100% sure whether it is a good idea to wait for 40 seconds as you 
suggested. Heres a scenario to think about:
Assume that we have a slow follower instead of a slow leader. Before the 
follower enters the for loop in question, there is a change in leader. Now, 
when the follower attempts to connect to the peer that it thinks is a leader 
(the old leader), it will fail with the same error as you are seeing. In this 
case, it is not worth asking the follower to wait for 40 seconds in that loop. 
Currently, ZK does not  preempt a peer once it has exited leader election (and 
is either trying to be a leader or a follower).

Note that there have been many fixes in leader election since ZK 3.2.2.

> Zookeeper servers fail to elect a leader succesfully.
> -----------------------------------------------------
>
>                 Key: ZOOKEEPER-1005
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1005
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.2.2
>         Environment: zookeeper-3.2.2; debian
>            Reporter: Alexandre Hardy
>             Fix For: 3.4.0
>
>
> We were running 3 zookeeper servers, and simulated a failure on one of the 
> servers. 
> The one zookeeper node follows the other, but has trouble connecting. It 
> looks like the following exception is the cause:
> {noformat}
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 INFO [zookeeper] --  
> [org.apache.zookeeper.server.quorum.QuorumPeer] FOLLOWING
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 INFO [zookeeper] --  
> [org.apache.zookeeper.server.ZooKeeperServer] Created server
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 INFO [zookeeper] --  
> [org.apache.zookeeper.server.quorum.Follower] Following 
> zookeeper3/192.168.131.11:2888
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING [zookeeper] --  
> [org.apache.zookeeper.server.quorum.Follower] Unexpected exception, tries=0
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING 
> java.net.ConnectException: --  Connection refused
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING  --     at 
> java.net.PlainSocketImpl.socketConnect(Native Method)
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING  --     at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:310)
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING  --     at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:176)
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING  --     at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:163)
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING  --     at 
> java.net.SocksSocketImpl.connect(SocksSocketImpl.java:384)
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING  --     at 
> java.net.Socket.connect(Socket.java:546)
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING  --     at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:156)
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING  --     at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:549)
> {noformat}
> The last exception while connecting was:
> {noformat}
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR [zookeeper] --  
> [org.apache.zookeeper.server.quorum.Follower] Unexpected exception
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR java.net.ConnectException: -- 
>  Connection refused
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR  --     at 
> java.net.PlainSocketImpl.socketConnect(Native Method)
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR  --     at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:310)
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR  --     at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:176)
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR  --     at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:163)
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR  --     at 
> java.net.SocksSocketImpl.connect(SocksSocketImpl.java:384)
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR  --     at 
> java.net.Socket.connect(Socket.java:546)
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR  --     at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:156)
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR  --     at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:549)
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 WARNING [zookeeper] --  
> [org.apache.zookeeper.server.quorum.Follower] Exception when following the 
> leader
> {noformat}
> The leader started leading a bit later 
> {noformat}
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  
> [org.apache.zookeeper.server.quorum.FastLeaderElection] Notification: 0, 
> 94489312534, 25, 2, LOOKING, LOOKING, 0
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  
> [org.apache.zookeeper.server.quorum.FastLeaderElection] Adding vote
> 2011-03-01T14:02:32+02:00 e0-cb-4e-65-4d-7d WARNING [zookeeper] --  
> [org.apache.zookeeper.server.quorum.QuorumCnxManager] Cannot open channel to 
> 1 at election address zookeeper2/192.168.132.10:3888
> 2011-03-01T14:02:32+02:00 e0-cb-4e-65-4d-7d WARNING  --     at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:323)
> 2011-03-01T14:02:50+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  
> [org.apache.zookeeper.server.quorum.QuorumPeer] LEADING
> {noformat}
> But at that time the follower had already terminated and started a new 
> election, so the leader failed:
> {noformat}
> 2011-03-01T14:02:50+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  
> [org.apache.zookeeper.server.ZooKeeperServer] Created server
> 2011-03-01T14:02:50+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  
> [org.apache.zookeeper.server.persistence.FileSnap] Reading snapshot 
> /var/lib/zookeeper/version-2/snapshot.1600007d16
> 2011-03-01T14:02:50+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  
> [org.apache.zookeeper.server.persistence.FileTxnSnapLog] Snapshotting: 
> 1600007d16
> 2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING [zookeeper] --  
> [org.apache.zookeeper.server.quorum.QuorumCnxManager] Cannot open channel to 
> 1 at election address zookeeper2/192.168.132.10:3888
> 2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING  --     at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:323)
> 2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING  --     at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:302)
> 2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING  --     at 
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:323)
> 2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING  --     at 
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:296)
> 2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  
> [org.apache.zookeeper.server.quorum.FastLeaderElection] Sending new 
> notification.
> 2011-03-01T14:03:11+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  
> [org.apache.zookeeper.server.quorum.FastLeaderElection] Sending new 
> notification.
> 2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING [zookeeper] --  
> [org.apache.zookeeper.server.quorum.QuorumCnxManager] Cannot open channel to 
> 1 at election address zookeeper2/192.168.132.10:3888
> 2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING  --     at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:323)
> 2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING  --     at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:302)
> 2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING  --     at 
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:323)
> 2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING  --     at 
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:296)
> 2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  
> [org.apache.zookeeper.server.quorum.FastLeaderElection] Sending new 
> notification.
> 2011-03-01T14:03:32+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  
> [org.apache.zookeeper.server.quorum.FastLeaderElection] Sending new 
> notification.
> 2011-03-01T14:03:34+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  
> [org.apache.zookeeper.server.quorum.Leader] Shutdown called
> 2011-03-01T14:03:34+02:00 e0-cb-4e-65-4d-7d INFO  --     at 
> org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:371)
> 2011-03-01T14:03:34+02:00 e0-cb-4e-65-4d-7d INFO  --     at 
> org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:297)
> 2011-03-01T14:03:34+02:00 e0-cb-4e-65-4d-7d INFO  --     at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:562)
> {noformat}
> From http://zookeeper.apache.org/doc/r3.2.2/zookeeperStarted.html:
> {quote}
> The new entry, initLimit is timeouts ZooKeeper uses to limit the length of 
> time the ZooKeeper servers in quorum have to connect to a leader
> {quote}
> Since we have initLimit=10 and tickTime=4000, we should have 40 seconds for a 
> zookeeper server to contact the leader.
> However, in the source code 
> src/java/main/org/apache/zookeeper/server/quorum/Follower.java:
> {noformat}
> 152                 for (int tries = 0; tries < 5; tries++) {
> 153                     try {
> 154                         //sock = new Socket();
> 155                         //sock.setSoTimeout(self.tickTime * 
> self.initLimit);
> 156                         sock.connect(addr, self.tickTime * 
> self.syncLimit);
> 157                         sock.setTcpNoDelay(nodelay);
> 158                         break;
> 159                     } catch (IOException e) {
> 160                         if (tries == 4) {
> 161                             LOG.error("Unexpected exception",e);
> 162                             throw e;
> 163                         } else {
> 164                             LOG.warn("Unexpected exception, 
> tries="+tries,e);
> 165                             sock = new Socket();
> 166                             sock.setSoTimeout(self.tickTime * 
> self.initLimit);
> 167                         }
> 168                     }
> 169                     Thread.sleep(1000);
> 170                 }
> {noformat}
> It appears as if we only have 4 seconds to contact the leader. The timeouts 
> are applied to the socket, but do not take into account that the zookeeper 
> leader may not have started its zookeeper service yet. 
> Is this the expected behaviour? Or is the expected behaviour that followers 
> should always be able to connect to the leader?

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to