[ 
https://issues.apache.org/jira/browse/HDFS-10477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yunjiong zhao updated HDFS-10477:
---------------------------------
    Status: Patch Available  (was: Open)

> Stop decommission a rack of DataNodes caused NameNode fail over to standby
> --------------------------------------------------------------------------
>
>                 Key: HDFS-10477
>                 URL: https://issues.apache.org/jira/browse/HDFS-10477
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.7.2
>            Reporter: yunjiong zhao
>            Assignee: yunjiong zhao
>         Attachments: HDFS-10477.patch
>
>
> In our cluster, when we stop decommissioning a rack which have 46 DataNodes, 
> it locked Namesystem for about 7 minutes as below log shows:
> {code}
> 2016-05-26 20:11:41,697 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.27:1004
> 2016-05-26 20:11:51,171 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 285258 over-replicated blocks on 10.142.27.27:1004 during recommissioning
> 2016-05-26 20:11:51,171 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.118:1004
> 2016-05-26 20:11:59,972 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 279923 over-replicated blocks on 10.142.27.118:1004 during recommissioning
> 2016-05-26 20:11:59,972 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.113:1004
> 2016-05-26 20:12:09,007 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 294307 over-replicated blocks on 10.142.27.113:1004 during recommissioning
> 2016-05-26 20:12:09,008 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.117:1004
> 2016-05-26 20:12:18,055 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 314381 over-replicated blocks on 10.142.27.117:1004 during recommissioning
> 2016-05-26 20:12:18,056 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.130:1004
> 2016-05-26 20:12:25,938 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 272779 over-replicated blocks on 10.142.27.130:1004 during recommissioning
> 2016-05-26 20:12:25,939 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.121:1004
> 2016-05-26 20:12:34,134 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 287248 over-replicated blocks on 10.142.27.121:1004 during recommissioning
> 2016-05-26 20:12:34,134 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.33:1004
> 2016-05-26 20:12:43,020 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 299868 over-replicated blocks on 10.142.27.33:1004 during recommissioning
> 2016-05-26 20:12:43,020 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.137:1004
> 2016-05-26 20:12:52,220 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 303914 over-replicated blocks on 10.142.27.137:1004 during recommissioning
> 2016-05-26 20:12:52,220 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.51:1004
> 2016-05-26 20:13:00,362 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 281175 over-replicated blocks on 10.142.27.51:1004 during recommissioning
> 2016-05-26 20:13:00,362 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.12:1004
> 2016-05-26 20:13:08,756 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 274880 over-replicated blocks on 10.142.27.12:1004 during recommissioning
> 2016-05-26 20:13:08,757 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.15:1004
> 2016-05-26 20:13:17,185 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 286334 over-replicated blocks on 10.142.27.15:1004 during recommissioning
> 2016-05-26 20:13:17,185 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.14:1004
> 2016-05-26 20:13:25,369 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 280219 over-replicated blocks on 10.142.27.14:1004 during recommissioning
> 2016-05-26 20:13:25,370 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.28:1004
> 2016-05-26 20:13:33,768 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 280623 over-replicated blocks on 10.142.27.28:1004 during recommissioning
> 2016-05-26 20:13:33,769 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.119:1004
> 2016-05-26 20:13:42,816 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 294675 over-replicated blocks on 10.142.27.119:1004 during recommissioning
> 2016-05-26 20:13:42,816 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.110:1004
> 2016-05-26 20:13:52,458 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 304269 over-replicated blocks on 10.142.27.110:1004 during recommissioning
> 2016-05-26 20:13:52,458 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.123:1004
> 2016-05-26 20:14:01,096 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 289332 over-replicated blocks on 10.142.27.123:1004 during recommissioning
> 2016-05-26 20:14:01,096 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.111:1004
> 2016-05-26 20:14:09,383 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 276981 over-replicated blocks on 10.142.27.111:1004 during recommissioning
> 2016-05-26 20:14:09,383 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.116:1004
> 2016-05-26 20:14:18,368 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 301089 over-replicated blocks on 10.142.27.116:1004 during recommissioning
> 2016-05-26 20:14:18,369 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.144:1004
> 2016-05-26 20:14:26,664 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 282171 over-replicated blocks on 10.142.27.144:1004 during recommissioning
> 2016-05-26 20:14:26,664 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.120:1004
> 2016-05-26 20:14:35,380 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 295046 over-replicated blocks on 10.142.27.120:1004 during recommissioning
> 2016-05-26 20:14:35,380 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.16:1004
> 2016-05-26 20:14:41,319 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 197929 over-replicated blocks on 10.142.27.16:1004 during recommissioning
> 2016-05-26 20:14:41,319 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.11:1004
> 2016-05-26 20:14:51,145 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 308037 over-replicated blocks on 10.142.27.11:1004 during recommissioning
> 2016-05-26 20:14:51,145 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.129:1004
> 2016-05-26 20:14:59,574 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 281704 over-replicated blocks on 10.142.27.129:1004 during recommissioning
> 2016-05-26 20:14:59,574 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.146:1004
> 2016-05-26 20:15:09,600 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 324806 over-replicated blocks on 10.142.27.146:1004 during recommissioning
> 2016-05-26 20:15:09,600 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.128:1004
> 2016-05-26 20:15:18,428 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 286412 over-replicated blocks on 10.142.27.128:1004 during recommissioning
> 2016-05-26 20:15:18,428 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.38:1004
> 2016-05-26 20:15:26,750 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 275447 over-replicated blocks on 10.142.27.38:1004 during recommissioning
> 2016-05-26 20:15:26,751 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.135:1004
> 2016-05-26 20:15:35,807 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 300286 over-replicated blocks on 10.142.27.135:1004 during recommissioning
> 2016-05-26 20:15:35,807 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.109:1004
> 2016-05-26 20:15:44,768 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 288725 over-replicated blocks on 10.142.27.109:1004 during recommissioning
> 2016-05-26 20:15:44,768 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.54:1004
> 2016-05-26 20:15:52,674 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 254111 over-replicated blocks on 10.142.27.54:1004 during recommissioning
> 2016-05-26 20:15:52,674 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.40:1004
> 2016-05-26 20:16:01,130 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 282691 over-replicated blocks on 10.142.27.40:1004 during recommissioning
> 2016-05-26 20:16:01,130 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.13:1004
> 2016-05-26 20:16:11,217 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 316102 over-replicated blocks on 10.142.27.13:1004 during recommissioning
> 2016-05-26 20:16:11,217 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.34:1004
> 2016-05-26 20:16:20,910 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 317771 over-replicated blocks on 10.142.27.34:1004 during recommissioning
> 2016-05-26 20:16:20,910 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.124:1004
> 2016-05-26 20:16:30,183 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 300669 over-replicated blocks on 10.142.27.124:1004 during recommissioning
> 2016-05-26 20:16:30,184 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.131:1004
> 2016-05-26 20:16:36,468 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 199658 over-replicated blocks on 10.142.27.131:1004 during recommissioning
> 2016-05-26 20:16:36,469 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.18:1004
> 2016-05-26 20:16:46,541 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 298408 over-replicated blocks on 10.142.27.18:1004 during recommissioning
> 2016-05-26 20:16:46,541 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.19:1004
> 2016-05-26 20:16:56,264 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 298501 over-replicated blocks on 10.142.27.19:1004 during recommissioning
> 2016-05-26 20:16:56,264 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.112:1004
> 2016-05-26 20:17:05,809 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 289439 over-replicated blocks on 10.142.27.112:1004 during recommissioning
> 2016-05-26 20:17:05,809 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.122:1004
> 2016-05-26 20:17:15,900 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 304616 over-replicated blocks on 10.142.27.122:1004 during recommissioning
> 2016-05-26 20:17:15,900 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.29:1004
> 2016-05-26 20:17:24,984 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 297533 over-replicated blocks on 10.142.27.29:1004 during recommissioning
> 2016-05-26 20:17:24,984 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.143:1004
> 2016-05-26 20:17:33,924 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 293859 over-replicated blocks on 10.142.27.143:1004 during recommissioning
> 2016-05-26 20:17:33,924 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.107:1004
> 2016-05-26 20:17:43,334 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 311050 over-replicated blocks on 10.142.27.107:1004 during recommissioning
> 2016-05-26 20:17:43,334 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.20:1004
> 2016-05-26 20:17:52,701 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 306078 over-replicated blocks on 10.142.27.20:1004 during recommissioning
> 2016-05-26 20:17:52,701 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.22:1004
> 2016-05-26 20:18:00,305 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 258606 over-replicated blocks on 10.142.27.22:1004 during recommissioning
> 2016-05-26 20:18:00,305 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.32:1004
> 2016-05-26 20:18:00,305 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.17:1004
> 2016-05-26 20:18:08,642 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 273960 over-replicated blocks on 10.142.27.17:1004 during recommissioning
> 2016-05-26 20:18:08,642 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.50:1004
> 2016-05-26 20:18:17,064 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 283001 over-replicated blocks on 10.142.27.50:1004 during recommissioning
> {code}
> And this caused ZKFC timeout (hostname replaced as *):
> {code}
> 2016-05-26 20:17:42,634 WARN org.apache.hadoop.ha.HealthMonitor: 
> Transport-level exception trying to monitor health of NameNode at 
> */10.103.108.200:8030: Call From */10.103.108.13 to *:8030 failed on socket 
> timeout exception: java.net.SocketTimeoutException: 360000 millis timeout 
> while waiting for channel to be ready for read. ch : 
> java.nio.channels.SocketChannel[connected local=/10.103.108.200:51433 
> remote=*/10.103.108.200:8030]; For more details see:  
> http://wiki.apache.org/hadoop/SocketTimeout
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to