ps. was there any instability in the quorum itself during this time period? Attaching logs to 1208 would be helpful.
Patrick On Thu, Nov 10, 2011 at 11:50 AM, Patrick Hunt <[email protected]> wrote: > Great report, some addl questions: > > 1) Why is the session closed, the client closed it or the cluster expired it? > > 2) which server was the session attached to - the first (44sec max > lat) or one of the others? Which server was the leader? > > 3) the znode exists on all 4 servers, is that right? > > 4) it's clear that a znode is being created for a session that's just > been closed. This is certainly a bug. Perhaps a timing issue due to > the huge latency hit that you reported. > > 5) why are your max latencies, as well as avg latency, so high? > a) are these dedicated boxes, not virtualized, correct? > b) is the jvm going into gc pause? (try turning on verbose logging, or > use "jstat" with the gc options to see the history if you still have > those jvms running) > c) swapping? > d) do you have dedicated spindles for the ZK WAL? If not another > process might be causing the fsyncs to pause. (you can use iostat or > strace to monitor this) > > My guess is b given it's the time btw the commit processor and > finalreq processor that's so long. Is that the log from the server > that's got the 44sec max latency? > > Patrick > > On Thu, Nov 10, 2011 at 11:15 AM, Neha Narkhede <[email protected]> > wrote: >> Hi, >> >> We are seeing this issue again with our Kafka zookeeper setup. Let me >> explain with the following log entries - >> >> Here is the node that didn't get deleted - >> >> get /kafka-tracking/brokers/ids/528 >> 172.18.98.101-1320781574264:172.18.98.101:10251 >> cZxid = 0x605f90a78ctime = Tue Nov 08 11:46:26 PST 2011 >> mZxid = 0x605f90a78mtime = Tue Nov 08 11:46:26 PST 2011 >> pZxid = 0x605f90a78cversion = 0 >> dataVersion = 0 >> aclVersion = 0 >> ephemeralOwner = 0x23220b93b4a33bc >> dataLength = 47 >> numChildren = 0 >> >> I ran Kishore's awesome version of the LogFormatter tool on the >> relevant portion of the zookeeper transaction logs - >> >> time:11/8/11 11:46:14 AM PST session:0x23220b93b4a33bc cxid:0x0 >> zxid:0x605f90a59 type:createSession timeOut:6000 >> time:11/8/11 11:46:26 AM PST session:0x23220b93b4a33bc cxid:0x0 >> zxid:0x605f90a72 type:closeSession >> time:11/8/11 11:46:26 AM PST session:0x23220b93b4a33bc cxid:0x1 >> zxid:0x605f90a78 type:create acl:[31,s{'world,'anyone} >> >> We want to understand why there was a 12 second delay in creation of >> znode after createSession >> >> Checking the logs, we found that the there is a 30+ sec delay between >> the commitProcessor and FinalRequestProcessor. >> >> 2011-11-08 11:46:14,661 - DEBUG >> [QuorumPeer:/0:0:0:0:0:0:0:0:12913:CommitProcessor@159] - Committing >> request:: sessionid:0x23220b93b4a33bc type:createSession >> cxid:0x0 zxid:0x605f90a59 txntype:-10 reqpath:n/a >> 2011-11-08 11:46:26,045 - DEBUG >> [QuorumPeer:/0:0:0:0:0:0:0:0:12913:CommitProcessor@159] - Committing >> request:: sessionid:0x23220b93b4a33bc type:closeSession >> cxid:0x0 zxid:0x605f90a72 txntype:-11 reqpath:n/a >> 2011-11-08 11:46:26,047 - DEBUG >> [QuorumPeer:/0:0:0:0:0:0:0:0:12913:CommitProcessor@159] - Committing >> request:: sessionid:0x23220b93b4a33bc type:create cxid:0 >> x1 zxid:0x605f90a78 txntype:1 reqpath:n/a >> 2011-11-08 11:46:59,539 - DEBUG >> [CommitProcessor:1:FinalRequestProcessor@78] - Processing request:: >> sessionid:0x23220b93b4a33bc type:createSession cxid:0x0 z >> xid:0x605f90a59 txntype:-10 reqpath:n/a >> 2011-11-08 11:46:59,587 - DEBUG >> [CommitProcessor:1:FinalRequestProcessor@78] - Processing request:: >> sessionid:0x23220b93b4a33bc type:closeSession cxid:0x0 zx >> id:0x605f90a72 txntype:-11 reqpath:n/a >> 2011-11-08 11:46:59,589 - DEBUG >> [CommitProcessor:1:FinalRequestProcessor@78] - Processing request:: >> sessionid:0x23220b93b4a33bc type:create cxid:0x1 zxid:0x6 >> 05f90a78 txntype:1 reqpath:n/a >> >> Here is the stat output from our zookeeper servers - >> >> Latency min/avg/max: 0/53/44712 >> Latency min/avg/max: 0/11/11727 >> Latency min/avg/max: 0/12/11994 >> Latency min/avg/max: 0/9/11707 >> >> The relevant bug is here - >> https://issues.apache.org/jira/browse/ZOOKEEPER-1208 >> >> This is a pretty serious issue since the fix is to have someone >> manually delete the ephemeral node. And until that is done, it blocks >> the startup of either the consumer client or the Kafka servers. Any >> help here in debugging this issue is appreciated ! >> >> Thanks, >> Neha >> >
