[jira] [Updated] (ZOOKEEPER-1731) Unsynchronized access to ServerCnxnFactory.connectionBeans results in deadlock

Dave Latham (JIRA) Mon, 29 Jul 2013 09:36:27 -0700

     [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Dave Latham updated ZOOKEEPER-1731:
-----------------------------------

    Attachment: ZOOKEEPER-1731.patch

Here's a simple patch that merely changes ServerCnxnFactory.connectionBeans to 
use a ConcurrentHashMap rather than a plain HashMap.  [~fournc] pointed out 
that this same fix was already done in 3.5 as part of ZOOKEEPER-1505.  
                
> Unsynchronized access to ServerCnxnFactory.connectionBeans results in deadlock
> ------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1731
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1731
>             Project: ZooKeeper
>          Issue Type: Bug
>            Reporter: Dave Latham
>            Priority: Critical
>             Fix For: 3.5.0, 3.4.6
>
>         Attachments: ZOOKEEPER-1731.patch
>
>
> We had a cluster of 3 peers (running 3.4.3) fail after we took down 1 peer 
> briefly for maintenance.  A second peer became unresponsive and the leader 
> lost quorum.  Thread dumps on the second peer showed two threads consistently 
> stuck in these states:
> {noformat}
> "QuorumPeer[myid=0]/0.0.0.0:2181" prio=10 tid=0x00002aaab8d20800 nid=0x598a 
> runnable [0x000000004335d000]
>    java.lang.Thread.State: RUNNABLE
>         at java.util.HashMap.put(HashMap.java:405)
>         at 
> org.apache.zookeeper.server.ServerCnxnFactory.registerConnection(ServerCnxnFactory.java:131)
>         at 
> org.apache.zookeeper.server.ZooKeeperServer.finishSessionInit(ZooKeeperServer.java:572)
>         at 
> org.apache.zookeeper.server.quorum.Learner.revalidate(Learner.java:444)
>         at 
> org.apache.zookeeper.server.quorum.Follower.processPacket(Follower.java:133)
>         at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:86)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:740)
> "NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181" daemon prio=10 
> tid=0x00002aaab84b0800 nid=0x5986 runnable [0x0000000040878000]
>    java.lang.Thread.State: RUNNABLE
>         at java.util.HashMap.removeEntryForKey(HashMap.java:614)
>         at java.util.HashMap.remove(HashMap.java:581)
>         at 
> org.apache.zookeeper.server.ServerCnxnFactory.unregisterConnection(ServerCnxnFactory.java:120)
>         at 
> org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:971)
>         - locked <0x000000078d8a51f0> (a java.util.HashSet)
>         at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.closeSessionWithoutWakeup(NIOServerCnxnFactory.java:307)
>         at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.closeSession(NIOServerCnxnFactory.java:294)
>         - locked <0x000000078d82c750> (a 
> org.apache.zookeeper.server.NIOServerCnxnFactory)
>         at 
> org.apache.zookeeper.server.ZooKeeperServer.processConnectRequest(ZooKeeperServer.java:834)
>         at 
> org.apache.zookeeper.server.NIOServerCnxn.readConnectRequest(NIOServerCnxn.java:410)
>         at 
> org.apache.zookeeper.server.NIOServerCnxn.readPayload(NIOServerCnxn.java:200)
>         at 
> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:236)
>         at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:224)
>         at java.lang.Thread.run(Thread.java:662)
> {noformat}
> It shows both threads concurrently modifying 
> ServerCnxnFactory.connectionBeans which is a java.util.HashMap.
> This cluster was serving thousands of clients, which seems to make this 
> condition more likely as it appears to occur when one client connects and 
> another disconnects at about the same time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (ZOOKEEPER-1731) Unsynchronized access to ServerCnxnFactory.connectionBeans results in deadlock

Reply via email to