Donny Nadolny created ZOOKEEPER-2201:
----------------------------------------

             Summary: Network issues can cause cluster to hang due to 
near-deadlock
                 Key: ZOOKEEPER-2201
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2201
             Project: ZooKeeper
          Issue Type: Bug
    Affects Versions: 3.4.6
            Reporter: Donny Nadolny
            Priority: Critical


{{DataTree.serializeNode}} synchronizes on the {{DataNode}} it is about to 
serialize then writes it out via {{OutputArchive.writeRecord}}, potentially to 
a network connection. Under default linux TCP settings, a network connection 
where the other side completely disappears will hang (blocking on the 
{{java.net.SocketOutputStream.socketWrite0}} call) for over 15 minutes. During 
this time, any attempt to create/delete/modify the {{DataNode}} will cause the 
leader to hang at the beginning of the request processor chain:

{noformat}
"ProcessThread(sid:5 cport:-1):" prio=10 tid=0x00000000026f1800 nid=0x379c 
waiting for monitor entry [0x00007fe6c2a8c000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at 
org.apache.zookeeper.server.PrepRequestProcessor.getRecordForPath(PrepRequestProcessor.java:163)
        - waiting to lock <0x00000000d4cd9e28> (a 
org.apache.zookeeper.server.DataNode)
        - locked <0x00000000d2ef81d0> (a java.util.ArrayList)
        at 
org.apache.zookeeper.server.PrepRequestProcessor.pRequest2Txn(PrepRequestProcessor.java:345)
        at 
org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:534)
        at 
org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:131)
{noformat}

Additionally, any attempt to send a snapshot to a follower or to disk will hang.

Because the ping packets are sent by another thread which is unaffected, 
followers never time out and become leader, even though the cluster will make 
no progress until either the leader is killed or the TCP connection times out. 
This isn't exactly a deadlock since it will resolve itself eventually, but as 
mentioned above this will take > 15 minutes with the default TCP retry settings 
in linux.

A simple solution to this is: in {{DataTree.serializeNode}} we can take a copy 
of the contents of the {{DataNode}} (as is done with its children) in the 
synchronized block, then call {{writeRecord}} with the copy of the {{DataNode}} 
outside of the original {{DataNode}} synchronized block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to