[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13235197#comment-13235197
 ] 

Jeremy Stribling commented on ZOOKEEPER-1375:
---------------------------------------------

I'll preface this by saying that I know nothing about this code, and probably 
have it all wrong.

Here's what I saw in production at a customer using ZK 3.3.3.  Because we are 
terrible people, we embed Zookeeper into a JVM running a bunch of other code.  
One server in a three-node cluster used up all of its memory (due to non-ZK 
code).  The JVM did not crash, it just stayed up, still listening on its 
Zookeeper ports.  The server kept sitting there spawning new threads to handle 
incoming connections, but it never closed those connections.  Incoming client 
connections (from C clients running elsewhere) timed out, and so they would try 
to re-connect to other servers -- that's fine.  The problem, from what I could 
tell, was that incoming server connections just seemed to freeze.  The two 
remaining Zookeeper servers in the cluster were unable to make any progress 
because they were trying to connect to the third server who never responded to 
them.  So even though there were two servers that were perfectly fine, they 
became stuck too and could never form a new majority, and so all of Zookeeper 
was down until we manually restarted the server that had run out of memory.

Clearly, there are things we need to do in our architecture to avoid this 
problem, but I was hoping to apply a quick fix to ZK in the meantime to help 
out -- the most important thing would be to properly close the connection if 
you're not able to service it.  It's also possible the problem I describe above 
actually isn't in ClientCnxn.java, but some other server connection file 
(NIOServerCnxn?)
                
> SendThread is exiting after OOMError
> ------------------------------------
>
>                 Key: ZOOKEEPER-1375
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1375
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.4.0
>            Reporter: Rakesh R
>
> After reviewing the ClientCnxn code, there is still chances of exiting the 
> SendThread without intimating the users. Say if client throws OOMError and 
> entered into the throwable block. Here again while sending the Disconnected 
> event, its creating "new WatchedEvent()" object.This will throw OOMError and 
> leads to exit the SendThread without any Disconnected event notification.
> {noformat}
> try{
>     //...
> } catch (Throwable e)
> {
>     //..
>     cleanup();
>    if(state.isAlive()){
>         eventThread.queueEvent(
>         new WatchedEvent(Event.EventType.None, 
> Event.KeeperState.Disconnected, null) )
>    }
>    //....
> }
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to