Ok, patch posted that fixes this (1208). Committers please take a look. Neha you might want to give a patched version a try. Awesome job helping to document and track down this issue. Thanks!
Patrick On Thu, Nov 10, 2011 at 4:43 PM, Patrick Hunt <[email protected]> wrote: > See my update to 1208 for a test that demonstrates this. > > On Thu, Nov 10, 2011 at 3:31 PM, Neha Narkhede <[email protected]> > wrote: >> Thanks Patrick for looking into this issue ! >> >>>> The logs would indicate if an election happens. Look for "LOOKING" or >> "LEADING" or "FOLLOWING". >> >> The logs don't have any such entries. So I'm guessing there was no election >> happening. >> >> Do you have thoughts, though, on how easy it would be to reproduce this >> bug, to verify the bug fix ? >> >> Thanks, >> Neha >> >> >> On Thu, Nov 10, 2011 at 2:08 PM, Patrick Hunt <[email protected]> wrote: >> >>> On Thu, Nov 10, 2011 at 1:52 PM, Neha Narkhede <[email protected]> >>> wrote: >>> > Thanks for the quick responses, guys! Please find my replies inline - >>> > >>> >>> 1) Why is the session closed, the client closed it or the cluster >>> > expired it? >>> > Cluster expired it. >>> > >>> >>> Yes, I realized after that the cxid is 0 in your logs - that indicates >>> it was expired and not closed explicitly by the client. >>> >>> >>> >>> 3) the znode exists on all 4 servers, is that right? >>> > Yes >>> > >>> >>> This holds up my theory that the PrepRequestProcessor is accepting a >>> create from the client after the session has been expired. >>> >>> >>> 5) why are your max latencies, as well as avg latency, so high? >>> >>> a) are these dedicated boxes, not virtualized, correct? >>> > these are dedicated boxes, but zk is currently co-located with kafka, but >>> > on different disks >>> > >>> >>> b) is the jvm going into gc pause? (try turning on verbose logging, or >>> > use "jstat" with the gc options to see the history if you still have >>> > those jvms running) >>> > I don't believe we had gc logs on these machines. So its unclear. >>> > >>> >>> d) do you have dedicated spindles for the ZK WAL? If not another >>> > process might be causing the fsyncs to pause. (you can use iostat or >>> > strace to monitor this) >>> > No. The log4j and zk txn logs share the same disks. >>> > >>> >>> Is that the log from the server that's got the 44sec max latency? >>> > Yes. >>> > >>> >>> This is 3.3.3 ? >>> > Yes. >>> > >>> >>> was there any instability in the quorum itself during this time >>> > period? >>> > How do I find that out ? >>> >>> The logs would indicate if an election happens. Look for "LOOKING" or >>> "LEADING" or "FOLLOWING". >>> >>> >>> Your comments are consistent with my theory. Seems like a bug in PRP >>> session validation to me. >>> >>> Patrick >>> >> >
