[jira] [Commented] (HDFS-14576) Avoid block report retry and slow down namenode startup

Chen Zhang (JIRA) Tue, 16 Jul 2019 05:34:13 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-14576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16886081#comment-16886081
 ]


Chen Zhang commented on HDFS-14576:
-----------------------------------

Hi [~hexiaoqiao], we've meet similar problem on our production environment, 
thousands of DataNode report at almost the same time usually cause the NameNode 
full GC, our solution is to throttle the max concurrent FBR(e.g. 10), NameNode 
will reject extra FBR (by throwing an exception), if a DataNode receive the 
exception on it's first FBR, it will gracefully wait for a peiord of random 
time(in a given range) before retry.

This solution works very well for us, so I want to contribute the code to 
community, but when I porting this commit, I found that the latest version 
already support this, It's implemented by BlockReport Lease, see HDFS-7923

We also tried HDFS-6763 and HDFS-7097 that you mentioned, but I think the block 
report throttle strategy is much more helpful on NameNode restart
{quote}but in later CDH versions several patches have been backported that made 
the initial block report problem largely disappear. Unfortunately I don't have 
the list of Jiras and their relative impact
{quote}
FYI，[~sodonnell]

> Avoid block report retry and slow down namenode startup
> -------------------------------------------------------
>
>                 Key: HDFS-14576
>                 URL: https://issues.apache.org/jira/browse/HDFS-14576
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: namenode
>            Reporter: He Xiaoqiao
>            Assignee: He Xiaoqiao
>            Priority: Major
>
> During namenode startup, the load will be very high since it has to process 
> every datanodes blockreport one by one. If there are hundreds datanodes block 
> reports pending process, the issue will be more serious even 
> #processFirstBlockReport is processed a lot more efficiently than ordinary 
> block reports. Then some of datanode will retry blockreport and lengthens 
> restart times. I think we should filter the block report request (via 
> datanode blockreport retries) which has be processed and return directly then 
> shorten down restart time. I want to state this proposal may be obvious only 
> for large cluster.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-14576) Avoid block report retry and slow down namenode startup

Reply via email to