Pat, That is excellent turnaround ! I will take a look at the running the test as well as your patch. Will be a good opportunity for me to start understanding the zookeeper codebase.
Thanks again, Neha On Thursday, November 10, 2011, Patrick Hunt <[email protected]> wrote: > Ok, patch posted that fixes this (1208). Committers please take a look. > > Neha you might want to give a patched version a try. Awesome job > helping to document and track down this issue. Thanks! > > Patrick > > On Thu, Nov 10, 2011 at 4:43 PM, Patrick Hunt <[email protected]> wrote: >> See my update to 1208 for a test that demonstrates this. >> >> On Thu, Nov 10, 2011 at 3:31 PM, Neha Narkhede <[email protected]> wrote: >>> Thanks Patrick for looking into this issue ! >>> >>>>> The logs would indicate if an election happens. Look for "LOOKING" or >>> "LEADING" or "FOLLOWING". >>> >>> The logs don't have any such entries. So I'm guessing there was no election >>> happening. >>> >>> Do you have thoughts, though, on how easy it would be to reproduce this >>> bug, to verify the bug fix ? >>> >>> Thanks, >>> Neha >>> >>> >>> On Thu, Nov 10, 2011 at 2:08 PM, Patrick Hunt <[email protected]> wrote: >>> >>>> On Thu, Nov 10, 2011 at 1:52 PM, Neha Narkhede <[email protected] > >>>> wrote: >>>> > Thanks for the quick responses, guys! Please find my replies inline - >>>> > >>>> >>> 1) Why is the session closed, the client closed it or the cluster >>>> > expired it? >>>> > Cluster expired it. >>>> > >>>> >>>> Yes, I realized after that the cxid is 0 in your logs - that indicates >>>> it was expired and not closed explicitly by the client. >>>> >>>> >>>> >>> 3) the znode exists on all 4 servers, is that right? >>>> > Yes >>>> > >>>> >>>> This holds up my theory that the PrepRequestProcessor is accepting a >>>> create from the client after the session has been expired. >>>> >>>> >>> 5) why are your max latencies, as well as avg latency, so high? >>>> >>> a) are these dedicated boxes, not virtualized, correct? >>>> > these are dedicated boxes, but zk is currently co-located with kafka, but >>>> > on different disks >>>> > >>>> >>> b) is the jvm going into gc pause? (try turning on verbose logging, or >>>> > use "jstat" with the gc options to see the history if you still have >>>> > those jvms running) >>>> > I don't believe we had gc logs on these machines. So its unclear. >>>> > >>>> >>> d) do you have dedicated spindles for the ZK WAL? If not another >>>> > process might be causing the fsyncs to pause. (you can use iostat or >>>> > strace to monitor this) >>>> > No. The log4j and zk txn logs share the same disks. >>>> > >>>> >>> Is that the log from the server that's got the 44sec max latency? >>>> > Yes. >>>> > >>>> >>> This is 3.3.3 ? >>>> > Yes. >>>> > >>>> >>> was there any instability in the quorum itself during this time >>>> > period? >>>> > How do I find that out ? >>>> >>>> The logs would indicate if an election happens. Look for "LOOKING" or >>>> "LEADING" or "FOLLOWING". >>>> >>>> >>>> Your comments are consistent with my theory. Seems like a bug in PRP >>>> session validation to me. >>>> >>>> Patrick >>>> >>> >> >
