[ 
https://issues.apache.org/jira/browse/HDFS-14186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16742738#comment-16742738
 ] 

He Xiaoqiao commented on HDFS-14186:
------------------------------------

Thanks further discussing. I would like to answer some doubts raised above.
 To [~kihwal],
{quote}one thing to note is that the rpc processing time can be misleading in 
this case.
{quote}
>From namenode sample log as following.
{quote}2019-01-14 22:32:35,383 INFO BlockStateChange: BLOCK* processReport: 
from storage DS-dd5c0397-3fcd-43fb-a71b-1eef6a2307f1 node 
DatanodeRegistration(datanodeip:50010, datanodeUuid=$datanodeuuid, 
infoPort=50075, infoSecurePort=0, ipcPort=50020, 
storageInfo=lv=-57;cid=$clusterud;nsid=$nsid;c=0), blocks: 11847, 
hasStaleStorage: true, processing time: 15 msecs
{quote}
The processing time is from namenode log rather than rpc processing time 
metrics, and it is exact I believe.
{quote}In 2.7 days, we ended up configuring datanodes breaking up block reports 
unconditionally and that helped NN startup performance.
{quote}
30K~40K blocks at per datanode and taking less than 60ms average for processing 
per block report, above 15K slaves overall. I do not split block report per 
storage since the number blocks of datanode is not large enough. as mentioned 
above, average 30K~40K per datanode, it is not necessary to split any more.
 The item 'dfs.blockreport.split.threshold' of configuration looks work well 
using 2.7.1 based on tracing code, please correct If I missing something.
{quote}we can have NN check whether all storage reports are received from all 
registered nodes.
{quote}
It is good suggestion. however, it is hard to collect all storage of the whole 
cluster when namenode startup, If only using registered nodes, maybe this issue 
could be not resolved completely since there may be some unregister datanode 
continue to report and the load of namenode could not release.

to [~elgoiri]
{quote}This is caused by namenode getting overwhelmed. Besides, the lifeline 
rpc will use the same service rpc port whose queue is constantly overrun in 
this case. For the lifeline server, one can set a different port so it should 
have a different RPC queue altogether, right?
{quote}
Thanks [~kihwal]'s detailed explain, another side, lifeline have unobviously 
effect when namenode startup due to the global lock of namenode and processing 
block report holds write lock, all register/report rpc have to queue and 
process one by one. In one word, NameNode have no remaining time and resource 
to process block report storm even if RPC can enqueue.

> blockreport storm slow down namenode restart seriously in large cluster
> -----------------------------------------------------------------------
>
>                 Key: HDFS-14186
>                 URL: https://issues.apache.org/jira/browse/HDFS-14186
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>            Reporter: He Xiaoqiao
>            Assignee: He Xiaoqiao
>            Priority: Major
>         Attachments: HDFS-14186.001.patch
>
>
> In the current implementation, the datanode sends blockreport immediately 
> after register to namenode successfully when restart, and the blockreport 
> storm will make namenode high load to process them. One result is some 
> received RPC have to skip because queue time is timeout. If some datanodes' 
> heartbeat RPC are continually skipped for long times (default is 
> heartbeatExpireInterval=630s) it will be set DEAD, then datanode has to 
> re-register and send blockreport again, aggravate blockreport storm and trap 
> in a vicious circle, and slow down (more than one hour and even more) 
> namenode startup seriously in a large (several thousands of datanodes) and 
> busy cluster especially. Although there are many work to optimize namenode 
> startup, the issue still exists. 
> I propose to postpone dead datanode check when namenode have finished startup.
> Any comments and suggestions are welcome.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to