Raul Gutierrez Segales created ZOOKEEPER-2202:
-------------------------------------------------

             Summary: Cluster crashes when reconfig adds an unreaachable 
observer
                 Key: ZOOKEEPER-2202
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2202
             Project: ZooKeeper
          Issue Type: Bug
            Reporter: Raul Gutierrez Segales


While adding support for reconfig() in Kazoo 
(https://github.com/python-zk/kazoo/pull/333) I found that the cluster can be 
crashed if you add an observer whose election port isn't reachable (i.e.: 
packets for that destination are dropped, not rejected). This will raise a 
SocketTimeoutException which will bring down the PrepRequestProcessor:

{code}
2015-06-02 14:37:16,473 [myid:3] - WARN  [ProcessThread(sid:3 
cport:-1)::QuorumCnxManager@384] - Cannot open channel to 100 at election 
address /8.8.8.8:38703
java.net.SocketTimeoutException: connect timed out
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:345)
        at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
        at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:589)
        at 
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:369)
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.connectNewPeers(QuorumPeer.java:1288)
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.setLastSeenQuorumVerifier(QuorumPeer.java:1315)
        at org.apache.zookeeper.server.quorum.Leader.propose(Leader.java:1056)
        at 
org.apache.zookeeper.server.quorum.ProposalRequestProcessor.processRequest(ProposalRequestProcessor.java:78)
        at 
org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:877)
        at 
org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:143)
{code}

A simple repro can be obtained by using the code in the referenced pull request 
above and using 8.8.8.8:3888 (for example) instead of a free (but closed) port 
in the loopback. 

I think that adding an Observer (or a Participant) that isn't currently 
reachable is a valid use case (i.e.: you are provisioning the machine and it's 
not currently needed) so I think we could handle this with lower connect 
timeouts, not sure. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to