[
https://issues.apache.org/jira/browse/HDFS-14576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16886081#comment-16886081
]
Chen Zhang commented on HDFS-14576:
-----------------------------------
Hi [~hexiaoqiao], we've meet similar problem on our production environment,
thousands of DataNode report at almost the same time usually cause the NameNode
full GC, our solution is to throttle the max concurrent FBR(e.g. 10), NameNode
will reject extra FBR (by throwing an exception), if a DataNode receive the
exception on it's first FBR, it will gracefully wait for a peiord of random
time(in a given range) before retry.
This solution works very well for us, so I want to contribute the code to
community, but when I porting this commit, I found that the latest version
already support this, It's implemented by BlockReport Lease, see HDFS-7923
We also tried HDFS-6763 and HDFS-7097 that you mentioned, but I think the block
report throttle strategy is much more helpful on NameNode restart
{quote}but in later CDH versions several patches have been backported that made
the initial block report problem largely disappear. Unfortunately I don't have
the list of Jiras and their relative impact
{quote}
FYI,[~sodonnell]
> Avoid block report retry and slow down namenode startup
> -------------------------------------------------------
>
> Key: HDFS-14576
> URL: https://issues.apache.org/jira/browse/HDFS-14576
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Components: namenode
> Reporter: He Xiaoqiao
> Assignee: He Xiaoqiao
> Priority: Major
>
> During namenode startup, the load will be very high since it has to process
> every datanodes blockreport one by one. If there are hundreds datanodes block
> reports pending process, the issue will be more serious even
> #processFirstBlockReport is processed a lot more efficiently than ordinary
> block reports. Then some of datanode will retry blockreport and lengthens
> restart times. I think we should filter the block report request (via
> datanode blockreport retries) which has be processed and return directly then
> shorten down restart time. I want to state this proposal may be obvious only
> for large cluster.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]