Donny Nadolny created ZOOKEEPER-2201:
----------------------------------------
Summary: Network issues can cause cluster to hang due to
near-deadlock
Key: ZOOKEEPER-2201
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2201
Project: ZooKeeper
Issue Type: Bug
Affects Versions: 3.4.6
Reporter: Donny Nadolny
Priority: Critical
{{DataTree.serializeNode}} synchronizes on the {{DataNode}} it is about to
serialize then writes it out via {{OutputArchive.writeRecord}}, potentially to
a network connection. Under default linux TCP settings, a network connection
where the other side completely disappears will hang (blocking on the
{{java.net.SocketOutputStream.socketWrite0}} call) for over 15 minutes. During
this time, any attempt to create/delete/modify the {{DataNode}} will cause the
leader to hang at the beginning of the request processor chain:
{noformat}
"ProcessThread(sid:5 cport:-1):" prio=10 tid=0x00000000026f1800 nid=0x379c
waiting for monitor entry [0x00007fe6c2a8c000]
java.lang.Thread.State: BLOCKED (on object monitor)
at
org.apache.zookeeper.server.PrepRequestProcessor.getRecordForPath(PrepRequestProcessor.java:163)
- waiting to lock <0x00000000d4cd9e28> (a
org.apache.zookeeper.server.DataNode)
- locked <0x00000000d2ef81d0> (a java.util.ArrayList)
at
org.apache.zookeeper.server.PrepRequestProcessor.pRequest2Txn(PrepRequestProcessor.java:345)
at
org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:534)
at
org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:131)
{noformat}
Additionally, any attempt to send a snapshot to a follower or to disk will hang.
Because the ping packets are sent by another thread which is unaffected,
followers never time out and become leader, even though the cluster will make
no progress until either the leader is killed or the TCP connection times out.
This isn't exactly a deadlock since it will resolve itself eventually, but as
mentioned above this will take > 15 minutes with the default TCP retry settings
in linux.
A simple solution to this is: in {{DataTree.serializeNode}} we can take a copy
of the contents of the {{DataNode}} (as is done with its children) in the
synchronized block, then call {{writeRecord}} with the copy of the {{DataNode}}
outside of the original {{DataNode}} synchronized block.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)