He Xiaoqiao created HDFS-14186:
----------------------------------
Summary: blockreport storm slow down namenode restart seriously in
large cluster
Key: HDFS-14186
URL: https://issues.apache.org/jira/browse/HDFS-14186
Project: Hadoop HDFS
Issue Type: Improvement
Components: namenode
Reporter: He Xiaoqiao
Assignee: He Xiaoqiao
In the current implementation, the datanode sends blockreport immediately after
register to namenode successfully when restart, and the blockreport storm will
make namenode high load to process them. One result is some received RPC have
to skip because queue time is timeout. If some datanodes' heartbeat RPC are
continually skipped for long times (default is heartbeatExpireInterval=630s) it
will be set DEAD, then datanode has to re-register and send blockreport again,
aggravate blockreport storm and trap in a vicious circle, and slow down (more
than one hour and even more) namenode startup seriously in a large (several
thousands of datanodes) and busy cluster especially. Although there are many
work to optimize namenode startup, the issue still exists.
I propose to postpone dead datanode check when namenode have finished startup.
Any comments and suggestions are welcome.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]