Arpit Agarwal created HDFS-9305:
-----------------------------------
Summary: Delayed heartbeat processing causes storm of subsequent
heartbeats
Key: HDFS-9305
URL: https://issues.apache.org/jira/browse/HDFS-9305
Project: Hadoop HDFS
Issue Type: Bug
Components: datanode
Affects Versions: 2.7.1
Reporter: Arpit Agarwal
Assignee: Arpit Agarwal
A DataNode typically sends a heartbeat to the NameNode every 3 seconds. We
expect heartbeat handling to complete relatively quickly. However, if
something unexpected causes heartbeat processing to get blocked, such as a long
GC or heavy lock contention within the NameNode, then heartbeat processing
would be delayed. After recovering from this delay, the DataNode then starts
sending a storm of heartbeat messages in a tight loop. In a large cluster with
many DataNodes, this storm of heartbeat messages could cause harmful load on
the NameNode and make overall cluster recovery more difficult.
The bug appears to be caused by incorrect timekeeping inside
{{BPServiceActor}}. The next heartbeat time is always calculated as a delta
from the previous heartbeat time, without any compensation for possible long
latency on an individual heartbeat RPC. The only mitigation would be
restarting all DataNodes to force a reset of the heartbeat schedule, or simply
wait out the storm until the scheduling catches up and corrects itself.
This problem would not manifest after a NameNode restart. In that case, the
NameNode would respond to the first heartbeat by telling the DataNode to
re-register, and {{BPServiceActor#reRegister}} would reset the heartbeat
schedule to the current time. I believe the problem would only manifest if the
NameNode process kept alive, but processed heartbeats unexpectedly slowly.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)