Hi all, We had an interesting problem with HBase and Zookeeper and I would like to know what your thoughts on this issue are.
I have an HBase client that reads data from a queue and stores it in HBase. While the client was running one of my colleagues stopped the ZK fleet (3 hosts), removed the ZK data from zoo.dataDir and restarted it (He wanted a fresh ZK fleet for a test). After that he restarted the HBase fleet. The HBase client noticed that the ZK fleet was restarted but after the ZK went online it was not able to reconnect or to close/expire the session. The client was stuck in an endless loop trying to reconnect. I left the client run for minutes an nothing happened. Tue Jun 07 12:52:03 2011 GMT Client [email protected]:0[INFO] (main-SendThread( pa-zk-na-03.aka.domain.com:2181)) org.apache.zookeeper.ClientCnxn: Opening socket connection to server pa-zk-na-01.aka.domain.com/10.119.206.58:2181 Tue Jun 07 12:52:03 2011 GMT Client [email protected]:0[INFO] (main-SendThread( pa-zk-na-01.aka.domain.com:2181)) org.apache.zookeeper.ClientCnxn: Socket connection established to pa-zk-na-01.aka.domain.com/10.119.206.58:2181, initiating session Tue Jun 07 12:52:03 2011 GMT Client [email protected]:0[INFO] (main-SendThread( pa-zk-na-01.aka.domain.com:2181)) org.apache.zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x30512689459590, likely server has closed socket, closing socket connection and attempting reconnect Tue Jun 07 12:52:03 2011 GMT Client [email protected]:0[INFO] (main-SendThread( pa-zk-na-01.aka.domain.com:2181)) org.apache.zookeeper.ClientCnxn: Opening socket connection to server pa-zk-na-02.aka.domain.com/10.194.180.66:2181 Tue Jun 07 12:52:03 2011 GMT Client [email protected]:0[INFO] (main-SendThread( pa-zk-na-02.aka.domain.com:2181)) org.apache.zookeeper.ClientCnxn: Socket connection established to pa-zk-na-02.aka.domain.com/10.194.180.66:2181, initiating session Tue Jun 07 12:52:03 2011 GMT Client [email protected]:0[INFO] (main-SendThread( pa-zk-na-02.aka.domain.com:2181)) org.apache.zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x30512689459590, likely server has closed socket, closing socket connection and attempting reconnect Tue Jun 07 12:52:04 2011 GMT Client [email protected]:0[INFO] (main-SendThread( pa-zk-na-02.aka.domain.com:2181)) org.apache.zookeeper.ClientCnxn: Opening socket connection to server pa-zk-na-03.aka.domain.com/10.254.106.137:2181 Tue Jun 07 12:52:04 2011 GMT Client [email protected]:0[INFO] (main-SendThread( pa-zk-na-03.aka.domain.com:2181)) org.apache.zookeeper.ClientCnxn: Socket connection established to pa-zk-na-03.aka.domain.com/10.254.106.137:2181, initiating session Tue Jun 07 12:52:04 2011 GMT Client [email protected]:0[INFO] (main-SendThread( pa-zk-na-03.aka.domain.com:2181)) org.apache.zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x30512689459590, likely server has closed socket, closing socket connection and attempting reconnect I checked the HBase code (ZooKeeperWatcher.java) and the connectionEvent(WatchedEvent event) method seems to ignore the Disconnected event. I do not expect my session to be terminated once a Disconnected event is received but I expect the session to be terminated if I cannot reconnect after a period of time (for example ZK session timeout or the negotiated timeout). http://wiki.apache.org/hadoop/ZooKeeper/FAQ#A3 The ZK wiki says that the client has to reconnect to receive the Expired event but this is not always possible. The ZK client library has to initiate the SessionExpired event (or a similar event like ClientSessionExpired) when the client is disconnected for more than X seconds. I assume there are other cases when the client and the quorum are both up and running but they cannot communicate (a network split for example). I think both the ZK client library and the quorum should act independently and expire the session on their side. Regards, Bogdan
