To close the loop on this...

I opened JIRA https://issues.apache.org/jira/browse/ZOOKEEPER-2151.  Then
Raul pointed out that it was probably
https://issues.apache.org/jira/browse/ZOOKEEPER-1863.  We were running
without that fix, and all the symptoms match, so I've resolved 2151 as a
duplicate.

Thanks for the help!

On Tue, Mar 24, 2015 at 9:07 PM, Jared Cantwell (JIRA) <[email protected]>
wrote:

>
>     [
> https://issues.apache.org/jira/browse/ZOOKEEPER-2151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14379204#comment-14379204
> ]
>
> Jared Cantwell commented on ZOOKEEPER-2151:
> -------------------------------------------
>
> Wow, that looks like this issue.  Thanks so much for pointing it out.  We
> aren't running with that patch.  I will dig into the heap dump tomorrow and
> verify the symptoms and resolve this tomorrow if things match up.
>
> > FollowerZookeeperServer has thousands of outstanding requests stuck in
> CommitProcessor
> >
> --------------------------------------------------------------------------------------
> >
> >                 Key: ZOOKEEPER-2151
> >                 URL:
> https://issues.apache.org/jira/browse/ZOOKEEPER-2151
> >             Project: ZooKeeper
> >          Issue Type: Bug
> >          Components: server
> >    Affects Versions: 3.5.0
> >         Environment: Ubuntu 12.04
> >            Reporter: Jared Cantwell
> >
> > We are seeing one follower server in our quorum stuck with thousands of
> outstanding requests:
> > ---------------------------------------------
> > node04:~$ telnet 10.10.10.6 2181
> > Trying 10.10.10.6...
> > Connected to 10.10.10.6.
> > Escape character is '^]'.
> > *stat*
> > Zookeeper version: 3.5.0-1547702, built on 05/15/2014 03:06 GMT
> > Clients:
> >  /10.10.10.6:60646\[0\](queued=0,recved=1,sent=0)
> >  /10.10.10.6:60648\[0\](queued=0,recved=1,sent=0)
> >  /10.10.10.6:41786\[0\](queued=1,recved=3,sent=1)
> > Latency min/avg/max: 0/0/1887
> > Received: 3064156900
> > Sent: 3064134581
> > Connections: 3
> > *Outstanding: 24395*
> > Zxid: 0x11050f7e4b
> > Mode: follower
> > Node count: 6969
> > Connection closed by foreign host.
> > ---------------------------------------------
> > When this happens, our c client is able to establish an initial
> connection to the server, but any request then times out.  It
> re-establishes a connection, then times out, rinse, repeat.  We are
> noticing this because we set up this particular client to connect directly
> to only one server in the quorum, so any problem with that server will be
> noticed.  Our other clients are just connecting to the next server in the
> list, which is why only this client notices a problem.
> > We were able to capture a heap dump in one instance.  This is what we
> observed:
> > - FollowerZookeeperServer.requestsInProcess has count ~24K
> > - CommitProcessor.queuedRequest list has the 24K items in it, so the
> FinalRequestProcessor's processRequest function isn't ever getting called
> to complete the requests.
> > - CommitProcessor.isWaitingForCommit()==true
> > - CommitProcessor.committedRequests.isEmpty()==true
> > - CommitProcessor.nextPending is a create request
> > - CommitProcessor.currentlyCommitting is null
> > - CommitProcessor.numRequestsProcessing is 0
> > - FollowerZookeeperServer, who should be calling commit() on the
> CommitProcessor, has no elements in its pendingTxns list, which indicates
> that it thinks it has already passed a COMMIT message to the
> CommitProcessor for every request that is stuck in the queuedRequests list
> and nextPending member of CommitProcessor.
> > The CommitProcessor's run() is doing this:
> > {quote}
> > Thread 23510: (state = BLOCKED)
> >    java.lang.Object.wait(long) @bci=0 (Compiled frame; information may
> be imprecise)
> >    org.apache.zookeeper.server.quorum.CommitProcessor.run() @bci=165,
> line=182 (Compiled frame)
> > {quote}
> > When we attached via gdb to get the dump, sockets closed that caused a
> new round of leader election.  When this happened, the issued corrected
> itself since the whole FollowerZookeeperServer got restarted.
> > I've confirmed that no time changing was happening before things got
> stuck 2 days before we noticed it.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>

Reply via email to