To answer the remaining client-side questions - >> Do you have multiple threads sharing a session? No
>> Q: what are you clients doing? It's weird that a create would come from the client after the session has been closed. Here are the client side logs. It seems like the client session kept timing out. 2011/11/08 11:46:13.423 INFO [ClientCnxn] [main-SendThread(app27.stg:12913)] [kafka] Client session timed out, have not heard from server in 4000ms for sessionid 0x43220b966ee3930, closing socket connection and attempting reconnect 2011/11/08 11:46:13.523 INFO [ZkClient] [main-EventThread] [kafka] zookeeper state changed (Disconnected)2011/11/08 11:46:13.562 INFO [ClientCnxn] [main-SendThread(app27.stg:12913)] [kafka] Opening socket connection to server esv4-app30.stg/172.18.98.90:129 132011/11/08 11:46:13.563 INFO [ClientCnxn] [main-SendThread(app30.stg:12913)] [kafka] Socket connection established to esv4-app30.stg/172.18.98.90:12913, initiating session2011/11/08 11:46:14.173 INFO [ClientCnxn] [main-SendThread(app30.stg:12913)] [kafka] Unable to reconnect to ZooKeeper service, session 0x43220b966ee3930 has expired, closing socket connection2011/11/08 11:46:14.195 INFO [ZkClient] [main-EventThread] [kafka] zookeeper state changed (Expired) 2011/11/08 11:46:14.195 INFO [ZooKeeper] [main-EventThread] [kafka] Initiating client connection, connectString=app27.stg:12913,esv4-app28.stg:12913,esv4-app29.stg:12913,esv4-app30.stg:12913/kafka-tracking sessionTimeout=6000 watcher=org.I0Itec.zkclient.ZkClient@679a9d19 2011/11/08 11:46:14.263 INFO [ClientCnxn] [main-SendThread()] [kafka] Opening socket connection to server app28.stg/172.18.98.101:129132011/11/0811:46:14.263 INFO [KafkaZooKeeper] [ZkClient-EventThread-495-esv4-app27.stg:12913,esv4-app28.stg:12913,esv4-app29.stg:12913,esv4-app30.stg:12913/k afka-tracking] [kafka] re-registering broker info in ZK for broker 5282011/11/08 11:46:14.263 INFO [ClientCnxn] [main-EventThread] [kafka] EventThread shut down 2011/11/08 11:46:14.263 INFO [KafkaZooKeeper] [ZkClient-EventThread-495-app27.stg:12913,esv4-app28.stg:12913,esv4-app29.stg:12913,esv4-app30.stg:12913/kafka-tracking] [kafka] Registering broker /brokers/ids/528 2011/11/08 11:46:14.264 INFO [ClientCnxn] [main-SendThread(app28.stg:12913)] [kafka] Socket connection established to esv4-app28.stg/172.18.98.101:12913, initiating session 2011/11/08 11:46:15.764 INFO [ClientCnxn] [main-SendThread(app28.stg:12913)] [kafka] Client session timed out, have not heard from server in 1500ms for sessionid 0x0, closing socket connection and attempting reconnect 2011/11/08 11:46:16.448 INFO [ClientCnxn] [main-SendThread(app28.stg:12913)] [kafka] Opening socket connection to server esv4-app29.stg/172.18.98.89:12913 2011/11/08 11:46:16.448 INFO [ClientCnxn] [main-SendThread(app29.stg:12913)] [kafka] Socket connection established to esv4-app29.stg/172.18.98.89:12913, initiating session 2011/11/08 11:46:17.948 INFO [ClientCnxn] [main-SendThread(app29.stg:12913)] [kafka] Client session timed out, have not heard from server in 1500ms for sessionid 0x0, closing socket connection and attempting reconnect 2011/11/08 11:46:18.168 INFO [ClientCnxn] [main-SendThread(app29.stg:12913)] [kafka] Opening socket connection to server esv4-app30.stg/172.18.98.90:12913 2011/11/08 11:46:18.169 INFO [ClientCnxn] [main-SendThread(app30.stg:12913)] [kafka] Socket connection established to esv4-app30.stg/172.18.98.90:12913, initiating session 2011/11/08 11:46:19.669 INFO [ClientCnxn] [main-SendThread(app30.stg:12913)] [kafka] Client session timed out, have not heard from server in 1500ms for sessionid 0x0, closing socket connection and attempting reconnect 2011/11/08 11:46:20.042 INFO [ClientCnxn] [main-SendThread(app30.stg:12913)] [kafka] Opening socket connection to server esv4-app27.stg/172.18.98.88:12913 2011/11/08 11:46:20.042 INFO [ClientCnxn] [main-SendThread(app27.stg:12913)] [kafka] Socket connection established to esv4-app27.stg/172.18.98.88:12913, initiating session 2011/11/08 11:46:21.543 INFO [ClientCnxn] [main-SendThread(app27.stg:12913)] [kafka] Client session timed out, have not heard from server in 1501ms for sessionid 0x0, closing socket connection and attempting reconnect 2011/11/08 11:46:21.841 INFO [ClientCnxn] [main-SendThread(app27.stg:12913)] [kafka] Opening socket connection to server esv4-app28.stg/172.18.98.101:12913 2011/11/08 11:46:21.842 INFO [ClientCnxn] [main-SendThread(app28.stg:12913)] [kafka] Socket connection established to esv4-app28.stg/172.18.98.101:12913, initiating session 2011/11/08 11:46:23.343 INFO [ClientCnxn] [main-SendThread(app28.stg:12913)] [kafka] Client session timed out, have not heard from server in 1501ms for sessionid 0x0, closing socket connection and attempting reconnect 2011/11/08 11:46:23.984 INFO [ClientCnxn] [main-SendThread(app28.stg:12913)] [kafka] Opening socket connection to server esv4-app29.stg/172.18.98.89:12913 2011/11/08 11:46:23.985 INFO [ClientCnxn] [main-SendThread(app29.stg:12913)] [kafka] Socket connection established to esv4-app29.stg/172.18.98.89:12913, initiating session 2011/11/08 11:46:25.486 INFO [ClientCnxn] [main-SendThread(app29.stg:12913)] [kafka] Client session timed out, have not heard from server in 1501ms for sessionid 0x0, closing socket connection and attempting reconnect 2011/11/08 11:46:25.637 INFO [ClientCnxn] [main-SendThread(app29.stg:12913)] [kafka] Opening socket connection to server esv4-app30.stg/172.18.98.90:12913 2011/11/08 11:46:25.637 INFO [ClientCnxn] [main-SendThread(app30.stg:12913)] [kafka] Socket connection established to esv4-app30.stg/172.18.98.90:12913, initiating session 2011/11/08 11:46:26.065 INFO [ClientCnxn] [main-SendThread(app30.stg:12913)] [kafka] Session establishment complete on server esv4-app30.stg/172.18.98.90:12913, sessionid = 0x43220b966ee394a, negotiated timeout = 6000 2011/11/08 11:46:26.066 INFO [ZkClient] [main-EventThread] [kafka] zookeeper state changed (SyncConnected) I will attach our conversation to the bug,. Thanks, Neha On Thu, Nov 10, 2011 at 1:52 PM, Neha Narkhede <[email protected]>wrote: > Thanks for the quick responses, guys! Please find my replies inline - > > > >> 1) Why is the session closed, the client closed it or the cluster > expired it? > Cluster expired it. > > > >> 2) which server was the session attached to - the first (44sec max > lat) or one of the others? Which server was the leader? > We didn't run srvr, so no idea which server was the leader. I still have > the zookeeper log4j and txn logs, as well as the client logs. Where do I > look to find this out ? > > > >> 3) the znode exists on all 4 servers, is that right? > Yes > > > >> 5) why are your max latencies, as well as avg latency, so high? > >> a) are these dedicated boxes, not virtualized, correct? > these are dedicated boxes, but zk is currently co-located with kafka, but > on different disks > > > >> b) is the jvm going into gc pause? (try turning on verbose logging, or > use "jstat" with the gc options to see the history if you still have > those jvms running) > I don't believe we had gc logs on these machines. So its unclear. > > > >> d) do you have dedicated spindles for the ZK WAL? If not another > process might be causing the fsyncs to pause. (you can use iostat or > strace to monitor this) > No. The log4j and zk txn logs share the same disks. > > > >> Is that the log from the server that's got the 44sec max latency? > Yes. > > >> This is 3.3.3 ? > Yes. > > > >> was there any instability in the quorum itself during this time > period? > How do I find that out ? > > Thanks, > Neha > > > On Thu, Nov 10, 2011 at 1:34 PM, Camille Fournier <[email protected]>wrote: > >> This is zk 3.3.3? >> >> From my phone >> On Nov 10, 2011 4:02 PM, "Patrick Hunt" <[email protected]> wrote: >> >>> On Thu, Nov 10, 2011 at 12:20 PM, Patrick Hunt <[email protected]> wrote: >>> > Q: what are you clients doing? It's weird that a create would come >>> > from the client after the session has been closed. Do you have >>> > multiple threads sharing a session? >>> >>> The client (checked java) seems to protect against this. Was the >>> session expired? IC. Based on the cxid of the close session being 0 >>> I'm guessing it's an expiration. >>> >>> Patrick >>> >> >
