On Thu, Nov 10, 2011 at 1:52 PM, Neha Narkhede <[email protected]> wrote: > Thanks for the quick responses, guys! Please find my replies inline - > >>> 1) Why is the session closed, the client closed it or the cluster > expired it? > Cluster expired it. >
Yes, I realized after that the cxid is 0 in your logs - that indicates it was expired and not closed explicitly by the client. >>> 3) the znode exists on all 4 servers, is that right? > Yes > This holds up my theory that the PrepRequestProcessor is accepting a create from the client after the session has been expired. >>> 5) why are your max latencies, as well as avg latency, so high? >>> a) are these dedicated boxes, not virtualized, correct? > these are dedicated boxes, but zk is currently co-located with kafka, but > on different disks > >>> b) is the jvm going into gc pause? (try turning on verbose logging, or > use "jstat" with the gc options to see the history if you still have > those jvms running) > I don't believe we had gc logs on these machines. So its unclear. > >>> d) do you have dedicated spindles for the ZK WAL? If not another > process might be causing the fsyncs to pause. (you can use iostat or > strace to monitor this) > No. The log4j and zk txn logs share the same disks. > >>> Is that the log from the server that's got the 44sec max latency? > Yes. > >>> This is 3.3.3 ? > Yes. > >>> was there any instability in the quorum itself during this time > period? > How do I find that out ? The logs would indicate if an election happens. Look for "LOOKING" or "LEADING" or "FOLLOWING". Your comments are consistent with my theory. Seems like a bug in PRP session validation to me. Patrick
