[ 
https://issues.apache.org/jira/browse/HDFS-10477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15366966#comment-15366966
 ] 

yunjiong zhao commented on HDFS-10477:
--------------------------------------

Those failed unit test is not related to this patch.
And there is no need to add new unit test for this patch since it's only add 
steps to release the nameSystem writeLock and then acquire the lock again.

> Stop decommission a rack of DataNodes caused NameNode fail over to standby
> --------------------------------------------------------------------------
>
>                 Key: HDFS-10477
>                 URL: https://issues.apache.org/jira/browse/HDFS-10477
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.7.2
>            Reporter: yunjiong zhao
>            Assignee: yunjiong zhao
>         Attachments: HDFS-10477.002.patch, HDFS-10477.003.patch, 
> HDFS-10477.patch
>
>
> In our cluster, when we stop decommissioning a rack which have 46 DataNodes, 
> it locked Namesystem for about 7 minutes as below log shows:
> {code}
> 2016-05-26 20:11:41,697 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.27:1004
> 2016-05-26 20:11:51,171 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 285258 over-replicated blocks on 10.142.27.27:1004 during recommissioning
> 2016-05-26 20:11:51,171 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.118:1004
> 2016-05-26 20:11:59,972 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 279923 over-replicated blocks on 10.142.27.118:1004 during recommissioning
> 2016-05-26 20:11:59,972 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.113:1004
> 2016-05-26 20:12:09,007 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 294307 over-replicated blocks on 10.142.27.113:1004 during recommissioning
> 2016-05-26 20:12:09,008 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.117:1004
> 2016-05-26 20:12:18,055 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 314381 over-replicated blocks on 10.142.27.117:1004 during recommissioning
> 2016-05-26 20:12:18,056 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.130:1004
> 2016-05-26 20:12:25,938 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 272779 over-replicated blocks on 10.142.27.130:1004 during recommissioning
> 2016-05-26 20:12:25,939 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.121:1004
> 2016-05-26 20:12:34,134 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 287248 over-replicated blocks on 10.142.27.121:1004 during recommissioning
> 2016-05-26 20:12:34,134 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.33:1004
> 2016-05-26 20:12:43,020 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 299868 over-replicated blocks on 10.142.27.33:1004 during recommissioning
> 2016-05-26 20:12:43,020 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.137:1004
> 2016-05-26 20:12:52,220 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 303914 over-replicated blocks on 10.142.27.137:1004 during recommissioning
> 2016-05-26 20:12:52,220 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.51:1004
> 2016-05-26 20:13:00,362 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 281175 over-replicated blocks on 10.142.27.51:1004 during recommissioning
> 2016-05-26 20:13:00,362 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.12:1004
> 2016-05-26 20:13:08,756 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 274880 over-replicated blocks on 10.142.27.12:1004 during recommissioning
> 2016-05-26 20:13:08,757 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.15:1004
> 2016-05-26 20:13:17,185 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 286334 over-replicated blocks on 10.142.27.15:1004 during recommissioning
> 2016-05-26 20:13:17,185 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.14:1004
> 2016-05-26 20:13:25,369 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 280219 over-replicated blocks on 10.142.27.14:1004 during recommissioning
> 2016-05-26 20:13:25,370 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.28:1004
> 2016-05-26 20:13:33,768 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 280623 over-replicated blocks on 10.142.27.28:1004 during recommissioning
> 2016-05-26 20:13:33,769 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.119:1004
> 2016-05-26 20:13:42,816 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 294675 over-replicated blocks on 10.142.27.119:1004 during recommissioning
> 2016-05-26 20:13:42,816 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.110:1004
> 2016-05-26 20:13:52,458 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 304269 over-replicated blocks on 10.142.27.110:1004 during recommissioning
> 2016-05-26 20:13:52,458 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.123:1004
> 2016-05-26 20:14:01,096 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 289332 over-replicated blocks on 10.142.27.123:1004 during recommissioning
> 2016-05-26 20:14:01,096 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.111:1004
> 2016-05-26 20:14:09,383 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 276981 over-replicated blocks on 10.142.27.111:1004 during recommissioning
> 2016-05-26 20:14:09,383 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.116:1004
> 2016-05-26 20:14:18,368 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 301089 over-replicated blocks on 10.142.27.116:1004 during recommissioning
> 2016-05-26 20:14:18,369 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.144:1004
> 2016-05-26 20:14:26,664 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 282171 over-replicated blocks on 10.142.27.144:1004 during recommissioning
> 2016-05-26 20:14:26,664 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.120:1004
> 2016-05-26 20:14:35,380 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 295046 over-replicated blocks on 10.142.27.120:1004 during recommissioning
> 2016-05-26 20:14:35,380 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.16:1004
> 2016-05-26 20:14:41,319 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 197929 over-replicated blocks on 10.142.27.16:1004 during recommissioning
> 2016-05-26 20:14:41,319 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.11:1004
> 2016-05-26 20:14:51,145 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 308037 over-replicated blocks on 10.142.27.11:1004 during recommissioning
> 2016-05-26 20:14:51,145 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.129:1004
> 2016-05-26 20:14:59,574 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 281704 over-replicated blocks on 10.142.27.129:1004 during recommissioning
> 2016-05-26 20:14:59,574 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.146:1004
> 2016-05-26 20:15:09,600 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 324806 over-replicated blocks on 10.142.27.146:1004 during recommissioning
> 2016-05-26 20:15:09,600 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.128:1004
> 2016-05-26 20:15:18,428 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 286412 over-replicated blocks on 10.142.27.128:1004 during recommissioning
> 2016-05-26 20:15:18,428 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.38:1004
> 2016-05-26 20:15:26,750 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 275447 over-replicated blocks on 10.142.27.38:1004 during recommissioning
> 2016-05-26 20:15:26,751 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.135:1004
> 2016-05-26 20:15:35,807 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 300286 over-replicated blocks on 10.142.27.135:1004 during recommissioning
> 2016-05-26 20:15:35,807 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.109:1004
> 2016-05-26 20:15:44,768 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 288725 over-replicated blocks on 10.142.27.109:1004 during recommissioning
> 2016-05-26 20:15:44,768 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.54:1004
> 2016-05-26 20:15:52,674 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 254111 over-replicated blocks on 10.142.27.54:1004 during recommissioning
> 2016-05-26 20:15:52,674 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.40:1004
> 2016-05-26 20:16:01,130 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 282691 over-replicated blocks on 10.142.27.40:1004 during recommissioning
> 2016-05-26 20:16:01,130 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.13:1004
> 2016-05-26 20:16:11,217 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 316102 over-replicated blocks on 10.142.27.13:1004 during recommissioning
> 2016-05-26 20:16:11,217 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.34:1004
> 2016-05-26 20:16:20,910 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 317771 over-replicated blocks on 10.142.27.34:1004 during recommissioning
> 2016-05-26 20:16:20,910 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.124:1004
> 2016-05-26 20:16:30,183 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 300669 over-replicated blocks on 10.142.27.124:1004 during recommissioning
> 2016-05-26 20:16:30,184 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.131:1004
> 2016-05-26 20:16:36,468 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 199658 over-replicated blocks on 10.142.27.131:1004 during recommissioning
> 2016-05-26 20:16:36,469 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.18:1004
> 2016-05-26 20:16:46,541 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 298408 over-replicated blocks on 10.142.27.18:1004 during recommissioning
> 2016-05-26 20:16:46,541 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.19:1004
> 2016-05-26 20:16:56,264 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 298501 over-replicated blocks on 10.142.27.19:1004 during recommissioning
> 2016-05-26 20:16:56,264 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.112:1004
> 2016-05-26 20:17:05,809 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 289439 over-replicated blocks on 10.142.27.112:1004 during recommissioning
> 2016-05-26 20:17:05,809 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.122:1004
> 2016-05-26 20:17:15,900 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 304616 over-replicated blocks on 10.142.27.122:1004 during recommissioning
> 2016-05-26 20:17:15,900 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.29:1004
> 2016-05-26 20:17:24,984 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 297533 over-replicated blocks on 10.142.27.29:1004 during recommissioning
> 2016-05-26 20:17:24,984 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.143:1004
> 2016-05-26 20:17:33,924 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 293859 over-replicated blocks on 10.142.27.143:1004 during recommissioning
> 2016-05-26 20:17:33,924 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.107:1004
> 2016-05-26 20:17:43,334 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 311050 over-replicated blocks on 10.142.27.107:1004 during recommissioning
> 2016-05-26 20:17:43,334 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.20:1004
> 2016-05-26 20:17:52,701 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 306078 over-replicated blocks on 10.142.27.20:1004 during recommissioning
> 2016-05-26 20:17:52,701 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.22:1004
> 2016-05-26 20:18:00,305 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 258606 over-replicated blocks on 10.142.27.22:1004 during recommissioning
> 2016-05-26 20:18:00,305 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.32:1004
> 2016-05-26 20:18:00,305 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.17:1004
> 2016-05-26 20:18:08,642 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 273960 over-replicated blocks on 10.142.27.17:1004 during recommissioning
> 2016-05-26 20:18:08,642 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.50:1004
> 2016-05-26 20:18:17,064 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 283001 over-replicated blocks on 10.142.27.50:1004 during recommissioning
> {code}
> And this caused ZKFC timeout (hostname replaced as *):
> {code}
> 2016-05-26 20:17:42,634 WARN org.apache.hadoop.ha.HealthMonitor: 
> Transport-level exception trying to monitor health of NameNode at 
> */10.103.108.200:8030: Call From */10.103.108.13 to *:8030 failed on socket 
> timeout exception: java.net.SocketTimeoutException: 360000 millis timeout 
> while waiting for channel to be ready for read. ch : 
> java.nio.channels.SocketChannel[connected local=/10.103.108.200:51433 
> remote=*/10.103.108.200:8030]; For more details see:  
> http://wiki.apache.org/hadoop/SocketTimeout
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to