I believe I see the issue. PrepRequestProcessor checks the session is valid on a create.
However it doesn't not know about an in-progress close session request The session is not invalidated (session tracker) until FinalRequestProcessor. At which point a create in PrepRP would fail. However prior to that it will get into the pipeline. Resulting in the issue you are seeing. Q: what are you clients doing? It's weird that a create would come from the client after the session has been closed. Do you have multiple threads sharing a session? Patrick On Thu, Nov 10, 2011 at 11:54 AM, Patrick Hunt <[email protected]> wrote: > ps. was there any instability in the quorum itself during this time > period? Attaching logs to 1208 would be helpful. > > Patrick > > On Thu, Nov 10, 2011 at 11:50 AM, Patrick Hunt <[email protected]> wrote: >> Great report, some addl questions: >> >> 1) Why is the session closed, the client closed it or the cluster expired it? >> >> 2) which server was the session attached to - the first (44sec max >> lat) or one of the others? Which server was the leader? >> >> 3) the znode exists on all 4 servers, is that right? >> >> 4) it's clear that a znode is being created for a session that's just >> been closed. This is certainly a bug. Perhaps a timing issue due to >> the huge latency hit that you reported. >> >> 5) why are your max latencies, as well as avg latency, so high? >> a) are these dedicated boxes, not virtualized, correct? >> b) is the jvm going into gc pause? (try turning on verbose logging, or >> use "jstat" with the gc options to see the history if you still have >> those jvms running) >> c) swapping? >> d) do you have dedicated spindles for the ZK WAL? If not another >> process might be causing the fsyncs to pause. (you can use iostat or >> strace to monitor this) >> >> My guess is b given it's the time btw the commit processor and >> finalreq processor that's so long. Is that the log from the server >> that's got the 44sec max latency? >> >> Patrick >> >> On Thu, Nov 10, 2011 at 11:15 AM, Neha Narkhede <[email protected]> >> wrote: >>> Hi, >>> >>> We are seeing this issue again with our Kafka zookeeper setup. Let me >>> explain with the following log entries - >>> >>> Here is the node that didn't get deleted - >>> >>> get /kafka-tracking/brokers/ids/528 >>> 172.18.98.101-1320781574264:172.18.98.101:10251 >>> cZxid = 0x605f90a78ctime = Tue Nov 08 11:46:26 PST 2011 >>> mZxid = 0x605f90a78mtime = Tue Nov 08 11:46:26 PST 2011 >>> pZxid = 0x605f90a78cversion = 0 >>> dataVersion = 0 >>> aclVersion = 0 >>> ephemeralOwner = 0x23220b93b4a33bc >>> dataLength = 47 >>> numChildren = 0 >>> >>> I ran Kishore's awesome version of the LogFormatter tool on the >>> relevant portion of the zookeeper transaction logs - >>> >>> time:11/8/11 11:46:14 AM PST session:0x23220b93b4a33bc cxid:0x0 >>> zxid:0x605f90a59 type:createSession timeOut:6000 >>> time:11/8/11 11:46:26 AM PST session:0x23220b93b4a33bc cxid:0x0 >>> zxid:0x605f90a72 type:closeSession >>> time:11/8/11 11:46:26 AM PST session:0x23220b93b4a33bc cxid:0x1 >>> zxid:0x605f90a78 type:create acl:[31,s{'world,'anyone} >>> >>> We want to understand why there was a 12 second delay in creation of >>> znode after createSession >>> >>> Checking the logs, we found that the there is a 30+ sec delay between >>> the commitProcessor and FinalRequestProcessor. >>> >>> 2011-11-08 11:46:14,661 - DEBUG >>> [QuorumPeer:/0:0:0:0:0:0:0:0:12913:CommitProcessor@159] - Committing >>> request:: sessionid:0x23220b93b4a33bc type:createSession >>> cxid:0x0 zxid:0x605f90a59 txntype:-10 reqpath:n/a >>> 2011-11-08 11:46:26,045 - DEBUG >>> [QuorumPeer:/0:0:0:0:0:0:0:0:12913:CommitProcessor@159] - Committing >>> request:: sessionid:0x23220b93b4a33bc type:closeSession >>> cxid:0x0 zxid:0x605f90a72 txntype:-11 reqpath:n/a >>> 2011-11-08 11:46:26,047 - DEBUG >>> [QuorumPeer:/0:0:0:0:0:0:0:0:12913:CommitProcessor@159] - Committing >>> request:: sessionid:0x23220b93b4a33bc type:create cxid:0 >>> x1 zxid:0x605f90a78 txntype:1 reqpath:n/a >>> 2011-11-08 11:46:59,539 - DEBUG >>> [CommitProcessor:1:FinalRequestProcessor@78] - Processing request:: >>> sessionid:0x23220b93b4a33bc type:createSession cxid:0x0 z >>> xid:0x605f90a59 txntype:-10 reqpath:n/a >>> 2011-11-08 11:46:59,587 - DEBUG >>> [CommitProcessor:1:FinalRequestProcessor@78] - Processing request:: >>> sessionid:0x23220b93b4a33bc type:closeSession cxid:0x0 zx >>> id:0x605f90a72 txntype:-11 reqpath:n/a >>> 2011-11-08 11:46:59,589 - DEBUG >>> [CommitProcessor:1:FinalRequestProcessor@78] - Processing request:: >>> sessionid:0x23220b93b4a33bc type:create cxid:0x1 zxid:0x6 >>> 05f90a78 txntype:1 reqpath:n/a >>> >>> Here is the stat output from our zookeeper servers - >>> >>> Latency min/avg/max: 0/53/44712 >>> Latency min/avg/max: 0/11/11727 >>> Latency min/avg/max: 0/12/11994 >>> Latency min/avg/max: 0/9/11707 >>> >>> The relevant bug is here - >>> https://issues.apache.org/jira/browse/ZOOKEEPER-1208 >>> >>> This is a pretty serious issue since the fix is to have someone >>> manually delete the ephemeral node. And until that is done, it blocks >>> the startup of either the consumer client or the Kafka servers. Any >>> help here in debugging this issue is appreciated ! >>> >>> Thanks, >>> Neha >>> >> >
