Greetings, We saw an issue recently that I've never seen before and am hoping I can get some clarity on what may cause this and whether it's a known issue. We had a 5 node ensemble and were unable to connect to one of the ZooKeeper instances. When trying to connect with zkCli it would timeout. When I connected via telnet and issued the srvr four letter word, I was surprised to see that this one server reported a massive number of 'Outstanding' requests. I'd never seen that really be anything other than 0 before. On the ZK dev guide it says:
"outstanding is the number of queued requests, this increases when the server is under load and is receiving more sustained requests than it can process, ie the request queue". I looked at all the ZK servers in my ensemble: for ip in 101 102 103 104 105; do echo srvr | nc 172.21.20.${ip} 2181 | grep Outstanding; done Outstanding: 0 Outstanding: 0 Outstanding: 0 Outstanding: 0 Outstanding: 18876 I eventually killed ZK on the affected server and everything corrected itself and Outstanding went to zero and I was able to connect again. Is this something anyone's familiar with? I have logs if it would be helpful. Thanks!