Re: ephemeral node not deleted after client session closed

Neha Narkhede Fri, 11 Nov 2011 07:21:40 -0800

Pat,

That is excellent turnaround ! I will take a look at the running the test
as well as your patch. Will be a good opportunity for me to start
understanding the zookeeper codebase.


Thanks again,
Neha

On Thursday, November 10, 2011, Patrick Hunt <[email protected]> wrote:
> Ok, patch posted that fixes this (1208). Committers please take a look.
>
> Neha you might want to give a patched version a try. Awesome job
> helping to document and track down this issue. Thanks!
>
> Patrick
>
> On Thu, Nov 10, 2011 at 4:43 PM, Patrick Hunt <[email protected]> wrote:
>> See my update to 1208 for a test that demonstrates this.
>>
>> On Thu, Nov 10, 2011 at 3:31 PM, Neha Narkhede <[email protected]>
wrote:
>>> Thanks Patrick for looking into this issue !
>>>
>>>>> The logs would indicate if an election happens. Look for "LOOKING" or
>>> "LEADING" or "FOLLOWING".
>>>
>>> The logs don't have any such entries. So I'm guessing there was no
election
>>> happening.
>>>
>>> Do you have thoughts, though, on how easy it would be to reproduce this
>>> bug, to verify the bug fix ?
>>>
>>> Thanks,
>>> Neha
>>>
>>>
>>> On Thu, Nov 10, 2011 at 2:08 PM, Patrick Hunt <[email protected]> wrote:
>>>
>>>> On Thu, Nov 10, 2011 at 1:52 PM, Neha Narkhede <[email protected]
>
>>>> wrote:
>>>> > Thanks for the quick responses, guys! Please find my replies inline -
>>>> >
>>>> >>> 1) Why is the session closed, the client closed it or the cluster
>>>> > expired it?
>>>> > Cluster expired it.
>>>> >
>>>>
>>>> Yes, I realized after that the cxid is 0 in your logs - that indicates
>>>> it was expired and not closed explicitly by the client.
>>>>
>>>>
>>>> >>> 3) the znode exists on all 4 servers, is that right?
>>>> > Yes
>>>> >
>>>>
>>>> This holds up my theory that the PrepRequestProcessor is accepting a
>>>> create from the client after the session has been expired.
>>>>
>>>> >>> 5) why are your max latencies, as well as avg latency, so high?
>>>> >>> a) are these dedicated boxes, not virtualized, correct?
>>>> > these are dedicated boxes, but zk is currently co-located with
kafka, but
>>>> > on different disks
>>>> >
>>>> >>> b) is the jvm going into gc pause? (try turning on verbose
logging, or
>>>> > use "jstat" with the gc options to see the history if you still have
>>>> > those jvms running)
>>>> > I don't believe we had gc logs on these machines. So its unclear.
>>>> >
>>>> >>> d) do you have dedicated spindles for the ZK WAL? If not another
>>>> > process might be causing the fsyncs to pause. (you can use iostat or
>>>> > strace to monitor this)
>>>> > No. The log4j and zk txn logs share the same disks.
>>>> >
>>>> >>> Is that the log from the server that's got the 44sec max latency?
>>>> > Yes.
>>>> >
>>>> >>> This is 3.3.3 ?
>>>> > Yes.
>>>> >
>>>> >>> was there any instability in the quorum itself during this time
>>>> > period?
>>>> > How do I find that out ?
>>>>
>>>> The logs would indicate if an election happens. Look for "LOOKING" or
>>>> "LEADING" or "FOLLOWING".
>>>>
>>>>
>>>> Your comments are consistent with my theory. Seems like a bug in PRP
>>>> session validation to me.
>>>>
>>>> Patrick
>>>>
>>>
>>
>

Re: ephemeral node not deleted after client session closed

Reply via email to