[
https://issues.apache.org/jira/browse/ZOOKEEPER-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15731197#comment-15731197
]
Michael Han commented on ZOOKEEPER-2202:
----------------------------------------
I agree with what Alex said - I think connectOne could be made to have a strong
guarantee that it does not throw given that higher level would retry
connection. connectOne would just swallow all type of exception, clean up
sockets, and do proper logging.
Also I am curious on this case as why the processor pipeline was brought down.
From the logging posted in the JIRA description, the exception thrown is
{noformat}java.net.SocketTimeoutException{noformat}, which is a sub class of
{noformat}java.io.IOException{noformat}, and connectOne is already catching
java.io.IOException so everything should be just fine.
Cross check with the "Cannot open channel to" in logging I believe the
exception get caught
[here|https://github.com/apache/zookeeper/blob/master/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L455].
So, is java.net.SocketTimeoutException is the real culprit in this case that
brought down request processor?
> Cluster crashes when reconfig adds an unreachable observer
> ----------------------------------------------------------
>
> Key: ZOOKEEPER-2202
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2202
> Project: ZooKeeper
> Issue Type: Bug
> Affects Versions: 3.5.0, 3.6.0
> Reporter: Raul Gutierrez Segales
> Assignee: Raul Gutierrez Segales
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2202.patch
>
>
> While adding support for reconfig() in Kazoo
> (https://github.com/python-zk/kazoo/pull/333) I found that the cluster can be
> crashed if you add an observer whose election port isn't reachable (i.e.:
> packets for that destination are dropped, not rejected). This will raise a
> SocketTimeoutException which will bring down the PrepRequestProcessor:
> {code}
> 2015-06-02 14:37:16,473 [myid:3] - WARN [ProcessThread(sid:3
> cport:-1)::QuorumCnxManager@384] - Cannot open channel to 100 at election
> address /8.8.8.8:38703
> java.net.SocketTimeoutException: connect timed out
> at java.net.PlainSocketImpl.socketConnect(Native Method)
> at
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:345)
> at
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
> at
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:369)
> at
> org.apache.zookeeper.server.quorum.QuorumPeer.connectNewPeers(QuorumPeer.java:1288)
> at
> org.apache.zookeeper.server.quorum.QuorumPeer.setLastSeenQuorumVerifier(QuorumPeer.java:1315)
> at org.apache.zookeeper.server.quorum.Leader.propose(Leader.java:1056)
> at
> org.apache.zookeeper.server.quorum.ProposalRequestProcessor.processRequest(ProposalRequestProcessor.java:78)
> at
> org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:877)
> at
> org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:143)
> {code}
> A simple repro can be obtained by using the code in the referenced pull
> request above and using 8.8.8.8:3888 (for example) instead of a free (but
> closed) port in the loopback.
> I think that adding an Observer (or a Participant) that isn't currently
> reachable is a valid use case (i.e.: you are provisioning the machine and
> it's not currently needed) so I think we could handle this with lower connect
> timeouts, not sure.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)