JiangJiafu created ZOOKEEPER-2701: ------------------------------------- Summary: Timeout for RecvWorker is too long Key: ZOOKEEPER-2701 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2701 Project: ZooKeeper Issue Type: Bug Affects Versions: 3.4.8 Environment: Centos6.5 ZooKeeper 3.4.8 Reporter: JiangJiafu Priority: Minor
Environment: I deploy ZooKeeper in a cluster of three nodes. Each node has three network interfaces(eth0, eth1, eth2). Hostname is used instead of IP address in zoo.cfg, and quorumListenOnAllIPs=true Probleam: I start three ZooKeeper servers( node A, node B, and node C) one by one, when the leader election finishes, node B is the leader. Then I shutdown one network interface of node A by command "ifdown eth0". The ZooKeeper server on node A will lost connection to node B and node C. In my test, I will take about 20 minites that the ZooKeepr server of node A realizes the event and try to call the QuorumServer.recreateSocketAddress the resolve the hostname. I try to read the source code, and I find the code in {code:title=QuorumCnxManager.java:|borderStyle=solid} class RecvWorker extends ZooKeeperThread { Long sid; Socket sock; volatile boolean running = true; final DataInputStream din; final SendWorker sw; RecvWorker(Socket sock, DataInputStream din, Long sid, SendWorker sw) { super("RecvWorker:" + sid); this.sid = sid; this.sock = sock; this.sw = sw; this.din = din; try { // OK to wait until socket disconnects while reading. sock.setSoTimeout(0); } catch (IOException e) { LOG.error("Error while accessing socket for " + sid, e); closeSocket(sock); running = false; } } ... } {code} I notice that the soTime is set to 0 in RecvWorker constructor. I think this is reasonable when the IP address of a ZooKeeper server never change, but considering that the IP address of each ZooKeeper server may change, maybe we should better set a timeout here. I am not pretty sure this is really a problem. -- This message was sent by Atlassian JIRA (v6.3.15#6346)