[
https://issues.apache.org/jira/browse/ZOOKEEPER-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13235197#comment-13235197
]
Jeremy Stribling commented on ZOOKEEPER-1375:
---------------------------------------------
I'll preface this by saying that I know nothing about this code, and probably
have it all wrong.
Here's what I saw in production at a customer using ZK 3.3.3. Because we are
terrible people, we embed Zookeeper into a JVM running a bunch of other code.
One server in a three-node cluster used up all of its memory (due to non-ZK
code). The JVM did not crash, it just stayed up, still listening on its
Zookeeper ports. The server kept sitting there spawning new threads to handle
incoming connections, but it never closed those connections. Incoming client
connections (from C clients running elsewhere) timed out, and so they would try
to re-connect to other servers -- that's fine. The problem, from what I could
tell, was that incoming server connections just seemed to freeze. The two
remaining Zookeeper servers in the cluster were unable to make any progress
because they were trying to connect to the third server who never responded to
them. So even though there were two servers that were perfectly fine, they
became stuck too and could never form a new majority, and so all of Zookeeper
was down until we manually restarted the server that had run out of memory.
Clearly, there are things we need to do in our architecture to avoid this
problem, but I was hoping to apply a quick fix to ZK in the meantime to help
out -- the most important thing would be to properly close the connection if
you're not able to service it. It's also possible the problem I describe above
actually isn't in ClientCnxn.java, but some other server connection file
(NIOServerCnxn?)
> SendThread is exiting after OOMError
> ------------------------------------
>
> Key: ZOOKEEPER-1375
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1375
> Project: ZooKeeper
> Issue Type: Bug
> Affects Versions: 3.4.0
> Reporter: Rakesh R
>
> After reviewing the ClientCnxn code, there is still chances of exiting the
> SendThread without intimating the users. Say if client throws OOMError and
> entered into the throwable block. Here again while sending the Disconnected
> event, its creating "new WatchedEvent()" object.This will throw OOMError and
> leads to exit the SendThread without any Disconnected event notification.
> {noformat}
> try{
> //...
> } catch (Throwable e)
> {
> //..
> cleanup();
> if(state.isAlive()){
> eventThread.queueEvent(
> new WatchedEvent(Event.EventType.None,
> Event.KeeperState.Disconnected, null) )
> }
> //....
> }
> {noformat}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira