[
https://issues.apache.org/jira/browse/HDFS-10365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271208#comment-15271208
]
Chackaravarthy commented on HDFS-10365:
---------------------------------------
Cluster details :
Version - Hadoop-2.6.0
No of datanodes - 1200
NN hardware - 74G heap allocated to NN process, 40 core machine
Total blocks - 80M+
Total Files/Directories - 60M+
Total FSObjects - 150M+
ipc.ping.interval=60s (default)
dfs.blockreport.initialDelay=120
dfs.namenode.service.handler.count=600
NN takes more than 1 minute to process FBR because of write lock getting
released while processing report for each storage. Since block report initial
delay set to 120s, NN is flooded with FBR from all DN's. After processing
report for each storage, lock contention is more and by the time it completes
processing for all storages (10 data dirs configured), DN gets timeout.
What if we acquire lock at the start and release it only after processing
reports for all storages? Since FBR call frequency is very less (only during
startup of NN or DN, once every 6 hours, or when a disk failure happens in DN)
will this change impact the normal heartbeat/IBR flow? Or acquiring lock at
each storage report processing is done intentionally? I could not find any
comment related to this in HDFS-4987 Please correct me If I am wrong.
I could see possible config options to overcome this to increase block report
initial delay or increasing ipc.ping.interval. Also may be 600 is not correct
value for service handler count. Is there any guideline to set service handler
count depending upon cluster size?
Thanks.
> FullBlockReports retransmission delays NN startup time in large cluster.
> ------------------------------------------------------------------------
>
> Key: HDFS-10365
> URL: https://issues.apache.org/jira/browse/HDFS-10365
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: hdfs
> Affects Versions: 2.6.0
> Environment: version - hadoop-2.6.0
> DN - 1200 nodes
> Reporter: Chackaravarthy
> Priority: Critical
>
> Whenever NN is restarted, it takes huge time for NN to come back to stable
> state. i.e. Last contact time remains more than 1 or 2 mins continuously for
> around 3 to 4 hours. This is mainly because most of the DN's getting timeout
> (60s) in blockReport (FBR) rpc call and then it keep sending FBR again.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]