[ 
https://issues.apache.org/jira/browse/HDFS-14186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736838#comment-16736838
 ] 

He Xiaoqiao commented on HDFS-14186:
------------------------------------

[~kihwal] Thanks for your comments.
{quote}I think nodes are already not marked "dead" in the startup safe mode
{quote}
As mentioned above, nodes are marked "dead" after NameNode leave safe mode.
{quote}Is your datanodes configured to break up the reports per storage and 
send one by one?
{quote}
I do not split block report per storage, but I do not think it is the key point 
for this issue. Because when I trace the log of NameNode, the process time 
about block report is almost less than 30ms, due to the optimization of the 
first block report. Another way, blocks num per DataNode for the separate 
namespace is less than 100K.
{quote}how much was the GC overhead? Was it replaying edits during the starup?
{quote}
do not found any GC/EditLogs exception action of the worst case I met, 2 times 
CMS GC, during ~100s, and no FGC found. YGC count is normal and STW is all less 
than 200ms. During startup, replaying editlogs about each 2 mins. and the most 
lock times is about ~50s(>2000K txn), other less than 10s.

As the worst case the beginning mentioned, it costs about 1hour to load fsimage 
+ replay editlogs + block report, and consider startup is done then auto leave 
safe mode, but 8040 port (separate 8040=service RPC port and 8020=RPC port) is 
continued FULL last ~7hours. During that time, ~1K different datanodes are 
marked dead and re-register/block report again and again.
 So I think leave safe mode if wait for the majority replication reported 
rather than the majority block reported may be resolve this case. FYI.

Add Environment Info: Hadoop-2.7.1, HA using QJM.

> blockreport storm slow down namenode restart seriously in large cluster
> -----------------------------------------------------------------------
>
>                 Key: HDFS-14186
>                 URL: https://issues.apache.org/jira/browse/HDFS-14186
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>            Reporter: He Xiaoqiao
>            Assignee: He Xiaoqiao
>            Priority: Major
>
> In the current implementation, the datanode sends blockreport immediately 
> after register to namenode successfully when restart, and the blockreport 
> storm will make namenode high load to process them. One result is some 
> received RPC have to skip because queue time is timeout. If some datanodes' 
> heartbeat RPC are continually skipped for long times (default is 
> heartbeatExpireInterval=630s) it will be set DEAD, then datanode has to 
> re-register and send blockreport again, aggravate blockreport storm and trap 
> in a vicious circle, and slow down (more than one hour and even more) 
> namenode startup seriously in a large (several thousands of datanodes) and 
> busy cluster especially. Although there are many work to optimize namenode 
> startup, the issue still exists. 
> I propose to postpone dead datanode check when namenode have finished startup.
> Any comments and suggestions are welcome.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to