[jira] [Comment Edited] (HDFS-10365) FullBlockReports retransmission delays NN startup time in large cluster.

Arpit Agarwal (JIRA) Wed, 04 May 2016 12:38:19 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-10365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271291#comment-15271291
 ]


Arpit Agarwal edited comment on HDFS-10365 at 5/4/16 7:37 PM:
--------------------------------------------------------------

Hi [~chackra], Chris mentioned a good list of Jiras to help with this problem.

Meanwhile you can try increasing {{dfs.blockreport.initialDelay}}. A rule of 
thumb I often use is 2*numDataNodes seconds. So for a 500 node cluster, I'd set 
it to 1000. It is a very conservative value and it increases the total NN 
startup time but I find it effective to improve the startup stability of the 
NameNode on older releases like 2.6.x that don't have some of the performance 
fixes listed by Chris.



was (Author: arpitagarwal):
Hi [~chackra], Chris mentioned a good list of Jiras to help with this problem.

Meanwhile you can try increasing {{dfs.blockreport.initialDelay}}. A rule of 
thumb I often use is 2*numDataNodes seconds. So for a 500 node cluster, I'd set 
it to 1000. It is a very conservative value and it increases the NN startup 
time but I find it effective to improve the startup performance of the NameNode 
on older releases like 2.6.x that don't have some of the performance fixes 
listed by Chris.


> FullBlockReports retransmission delays NN startup time in large cluster.
> ------------------------------------------------------------------------
>
>                 Key: HDFS-10365
>                 URL: https://issues.apache.org/jira/browse/HDFS-10365
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs
>    Affects Versions: 2.6.0
>         Environment: version - hadoop-2.6.0 (hdp-2.2)
> DN - 1200 nodes
>            Reporter: Chackaravarthy
>            Priority: Critical
>
> Whenever NN is restarted, it takes huge time for NN to come back to stable 
> state. i.e. Last contact time remains more than 1 or 2 mins continuously for 
> around 3 to 4 hours. This is mainly because most of the DN's getting timeout 
> (60s) in blockReport (FBR) rpc call and then it keep sending FBR again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (HDFS-10365) FullBlockReports retransmission delays NN startup time in large cluster.

Reply via email to