yunjiong zhao created HDFS-10477:
------------------------------------
Summary: Stop decommission a rack of DataNodes caused NameNode
fail over to standby
Key: HDFS-10477
URL: https://issues.apache.org/jira/browse/HDFS-10477
Project: Hadoop HDFS
Issue Type: Bug
Components: namenode
Affects Versions: 2.7.2
Reporter: yunjiong zhao
Assignee: yunjiong zhao
In our cluster, when we stop decommissioning a rack which have 46 DataNodes, it
locked Namesystem for about 7 minutes as below log shows:
{code}
2016-05-26 20:11:41,697 INFO
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop
Decommissioning 10.142.27.27:1004
2016-05-26 20:11:51,171 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 285258
over-replicated blocks on 10.142.27.27:1004 during recommissioning
2016-05-26 20:11:51,171 INFO
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop
Decommissioning 10.142.27.118:1004
2016-05-26 20:11:59,972 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 279923
over-replicated blocks on 10.142.27.118:1004 during recommissioning
2016-05-26 20:11:59,972 INFO
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop
Decommissioning 10.142.27.113:1004
2016-05-26 20:12:09,007 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 294307
over-replicated blocks on 10.142.27.113:1004 during recommissioning
2016-05-26 20:12:09,008 INFO
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop
Decommissioning 10.142.27.117:1004
2016-05-26 20:12:18,055 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 314381
over-replicated blocks on 10.142.27.117:1004 during recommissioning
2016-05-26 20:12:18,056 INFO
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop
Decommissioning 10.142.27.130:1004
2016-05-26 20:12:25,938 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 272779
over-replicated blocks on 10.142.27.130:1004 during recommissioning
2016-05-26 20:12:25,939 INFO
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop
Decommissioning 10.142.27.121:1004
2016-05-26 20:12:34,134 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 287248
over-replicated blocks on 10.142.27.121:1004 during recommissioning
2016-05-26 20:12:34,134 INFO
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop
Decommissioning 10.142.27.33:1004
2016-05-26 20:12:43,020 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 299868
over-replicated blocks on 10.142.27.33:1004 during recommissioning
2016-05-26 20:12:43,020 INFO
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop
Decommissioning 10.142.27.137:1004
2016-05-26 20:12:52,220 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 303914
over-replicated blocks on 10.142.27.137:1004 during recommissioning
2016-05-26 20:12:52,220 INFO
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop
Decommissioning 10.142.27.51:1004
2016-05-26 20:13:00,362 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 281175
over-replicated blocks on 10.142.27.51:1004 during recommissioning
2016-05-26 20:13:00,362 INFO
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop
Decommissioning 10.142.27.12:1004
2016-05-26 20:13:08,756 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 274880
over-replicated blocks on 10.142.27.12:1004 during recommissioning
2016-05-26 20:13:08,757 INFO
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop
Decommissioning 10.142.27.15:1004
2016-05-26 20:13:17,185 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 286334
over-replicated blocks on 10.142.27.15:1004 during recommissioning
2016-05-26 20:13:17,185 INFO
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop
Decommissioning 10.142.27.14:1004
2016-05-26 20:13:25,369 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 280219
over-replicated blocks on 10.142.27.14:1004 during recommissioning
2016-05-26 20:13:25,370 INFO
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop
Decommissioning 10.142.27.28:1004
2016-05-26 20:13:33,768 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 280623
over-replicated blocks on 10.142.27.28:1004 during recommissioning
2016-05-26 20:13:33,769 INFO
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop
Decommissioning 10.142.27.119:1004
2016-05-26 20:13:42,816 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 294675
over-replicated blocks on 10.142.27.119:1004 during recommissioning
2016-05-26 20:13:42,816 INFO
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop
Decommissioning 10.142.27.110:1004
2016-05-26 20:13:52,458 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 304269
over-replicated blocks on 10.142.27.110:1004 during recommissioning
2016-05-26 20:13:52,458 INFO
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop
Decommissioning 10.142.27.123:1004
2016-05-26 20:14:01,096 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 289332
over-replicated blocks on 10.142.27.123:1004 during recommissioning
2016-05-26 20:14:01,096 INFO
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop
Decommissioning 10.142.27.111:1004
2016-05-26 20:14:09,383 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 276981
over-replicated blocks on 10.142.27.111:1004 during recommissioning
2016-05-26 20:14:09,383 INFO
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop
Decommissioning 10.142.27.116:1004
2016-05-26 20:14:18,368 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 301089
over-replicated blocks on 10.142.27.116:1004 during recommissioning
2016-05-26 20:14:18,369 INFO
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop
Decommissioning 10.142.27.144:1004
2016-05-26 20:14:26,664 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 282171
over-replicated blocks on 10.142.27.144:1004 during recommissioning
2016-05-26 20:14:26,664 INFO
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop
Decommissioning 10.142.27.120:1004
2016-05-26 20:14:35,380 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 295046
over-replicated blocks on 10.142.27.120:1004 during recommissioning
2016-05-26 20:14:35,380 INFO
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop
Decommissioning 10.142.27.16:1004
2016-05-26 20:14:41,319 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 197929
over-replicated blocks on 10.142.27.16:1004 during recommissioning
2016-05-26 20:14:41,319 INFO
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop
Decommissioning 10.142.27.11:1004
2016-05-26 20:14:51,145 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 308037
over-replicated blocks on 10.142.27.11:1004 during recommissioning
2016-05-26 20:14:51,145 INFO
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop
Decommissioning 10.142.27.129:1004
2016-05-26 20:14:59,574 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 281704
over-replicated blocks on 10.142.27.129:1004 during recommissioning
2016-05-26 20:14:59,574 INFO
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop
Decommissioning 10.142.27.146:1004
2016-05-26 20:15:09,600 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 324806
over-replicated blocks on 10.142.27.146:1004 during recommissioning
2016-05-26 20:15:09,600 INFO
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop
Decommissioning 10.142.27.128:1004
2016-05-26 20:15:18,428 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 286412
over-replicated blocks on 10.142.27.128:1004 during recommissioning
2016-05-26 20:15:18,428 INFO
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop
Decommissioning 10.142.27.38:1004
2016-05-26 20:15:26,750 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 275447
over-replicated blocks on 10.142.27.38:1004 during recommissioning
2016-05-26 20:15:26,751 INFO
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop
Decommissioning 10.142.27.135:1004
2016-05-26 20:15:35,807 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 300286
over-replicated blocks on 10.142.27.135:1004 during recommissioning
2016-05-26 20:15:35,807 INFO
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop
Decommissioning 10.142.27.109:1004
2016-05-26 20:15:44,768 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 288725
over-replicated blocks on 10.142.27.109:1004 during recommissioning
2016-05-26 20:15:44,768 INFO
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop
Decommissioning 10.142.27.54:1004
2016-05-26 20:15:52,674 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 254111
over-replicated blocks on 10.142.27.54:1004 during recommissioning
2016-05-26 20:15:52,674 INFO
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop
Decommissioning 10.142.27.40:1004
2016-05-26 20:16:01,130 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 282691
over-replicated blocks on 10.142.27.40:1004 during recommissioning
2016-05-26 20:16:01,130 INFO
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop
Decommissioning 10.142.27.13:1004
2016-05-26 20:16:11,217 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 316102
over-replicated blocks on 10.142.27.13:1004 during recommissioning
2016-05-26 20:16:11,217 INFO
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop
Decommissioning 10.142.27.34:1004
2016-05-26 20:16:20,910 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 317771
over-replicated blocks on 10.142.27.34:1004 during recommissioning
2016-05-26 20:16:20,910 INFO
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop
Decommissioning 10.142.27.124:1004
2016-05-26 20:16:30,183 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 300669
over-replicated blocks on 10.142.27.124:1004 during recommissioning
2016-05-26 20:16:30,184 INFO
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop
Decommissioning 10.142.27.131:1004
2016-05-26 20:16:36,468 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 199658
over-replicated blocks on 10.142.27.131:1004 during recommissioning
2016-05-26 20:16:36,469 INFO
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop
Decommissioning 10.142.27.18:1004
2016-05-26 20:16:46,541 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 298408
over-replicated blocks on 10.142.27.18:1004 during recommissioning
2016-05-26 20:16:46,541 INFO
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop
Decommissioning 10.142.27.19:1004
2016-05-26 20:16:56,264 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 298501
over-replicated blocks on 10.142.27.19:1004 during recommissioning
2016-05-26 20:16:56,264 INFO
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop
Decommissioning 10.142.27.112:1004
2016-05-26 20:17:05,809 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 289439
over-replicated blocks on 10.142.27.112:1004 during recommissioning
2016-05-26 20:17:05,809 INFO
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop
Decommissioning 10.142.27.122:1004
2016-05-26 20:17:15,900 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 304616
over-replicated blocks on 10.142.27.122:1004 during recommissioning
2016-05-26 20:17:15,900 INFO
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop
Decommissioning 10.142.27.29:1004
2016-05-26 20:17:24,984 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 297533
over-replicated blocks on 10.142.27.29:1004 during recommissioning
2016-05-26 20:17:24,984 INFO
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop
Decommissioning 10.142.27.143:1004
2016-05-26 20:17:33,924 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 293859
over-replicated blocks on 10.142.27.143:1004 during recommissioning
2016-05-26 20:17:33,924 INFO
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop
Decommissioning 10.142.27.107:1004
2016-05-26 20:17:43,334 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 311050
over-replicated blocks on 10.142.27.107:1004 during recommissioning
2016-05-26 20:17:43,334 INFO
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop
Decommissioning 10.142.27.20:1004
2016-05-26 20:17:52,701 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 306078
over-replicated blocks on 10.142.27.20:1004 during recommissioning
2016-05-26 20:17:52,701 INFO
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop
Decommissioning 10.142.27.22:1004
2016-05-26 20:18:00,305 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 258606
over-replicated blocks on 10.142.27.22:1004 during recommissioning
2016-05-26 20:18:00,305 INFO
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop
Decommissioning 10.142.27.32:1004
2016-05-26 20:18:00,305 INFO
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop
Decommissioning 10.142.27.17:1004
2016-05-26 20:18:08,642 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 273960
over-replicated blocks on 10.142.27.17:1004 during recommissioning
2016-05-26 20:18:08,642 INFO
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop
Decommissioning 10.142.27.50:1004
2016-05-26 20:18:17,064 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 283001
over-replicated blocks on 10.142.27.50:1004 during recommissioning
{code}
And this caused ZKFC timeout (hostname replaced as *):
{code}
2016-05-26 20:17:42,634 WARN org.apache.hadoop.ha.HealthMonitor:
Transport-level exception trying to monitor health of NameNode at
*/10.103.108.200:8030: Call From */10.103.108.13 to *:8030 failed on socket
timeout exception: java.net.SocketTimeoutException: 360000 millis timeout while
waiting for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/10.103.108.200:51433
remote=*/10.103.108.200:8030]; For more details see:
http://wiki.apache.org/hadoop/SocketTimeout
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]