[jira] [Comment Edited] (HDFS-10477) Stop decommission a rack of DataNodes caused NameNode fail over to standby

2019-04-04 Thread Wei-Chiu Chuang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-10477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809801#comment-16809801
 ] 

Wei-Chiu Chuang edited comment on HDFS-10477 at 4/4/19 12:40 PM:
-

Findbugs warning unrelated, caused by a previous commit. Will file a Jira to 
clean it up (HDFS-14414).


was (Author: jojochuang):
Findbugs warning unrelated, caused by a previous commit. Will file a Jira to 
clean it up.

> Stop decommission a rack of DataNodes caused NameNode fail over to standby
> --
>
> Key: HDFS-10477
> URL: https://issues.apache.org/jira/browse/HDFS-10477
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.2
>Reporter: yunjiong zhao
>Assignee: yunjiong zhao
>Priority: Major
> Fix For: 3.0.4, 3.3.0, 3.2.1, 3.1.3
>
> Attachments: HDFS-10477.002.patch, HDFS-10477.003.patch, 
> HDFS-10477.004.patch, HDFS-10477.005.patch, HDFS-10477.006.patch, 
> HDFS-10477.007.patch, HDFS-10477.branch-2.patch, HDFS-10477.patch
>
>
> In our cluster, when we stop decommissioning a rack which have 46 DataNodes, 
> it locked Namesystem for about 7 minutes as below log shows:
> {code}
> 2016-05-26 20:11:41,697 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.27:1004
> 2016-05-26 20:11:51,171 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 285258 over-replicated blocks on 10.142.27.27:1004 during recommissioning
> 2016-05-26 20:11:51,171 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.118:1004
> 2016-05-26 20:11:59,972 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 279923 over-replicated blocks on 10.142.27.118:1004 during recommissioning
> 2016-05-26 20:11:59,972 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.113:1004
> 2016-05-26 20:12:09,007 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 294307 over-replicated blocks on 10.142.27.113:1004 during recommissioning
> 2016-05-26 20:12:09,008 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.117:1004
> 2016-05-26 20:12:18,055 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 314381 over-replicated blocks on 10.142.27.117:1004 during recommissioning
> 2016-05-26 20:12:18,056 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.130:1004
> 2016-05-26 20:12:25,938 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 272779 over-replicated blocks on 10.142.27.130:1004 during recommissioning
> 2016-05-26 20:12:25,939 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.121:1004
> 2016-05-26 20:12:34,134 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 287248 over-replicated blocks on 10.142.27.121:1004 during recommissioning
> 2016-05-26 20:12:34,134 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.33:1004
> 2016-05-26 20:12:43,020 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 299868 over-replicated blocks on 10.142.27.33:1004 during recommissioning
> 2016-05-26 20:12:43,020 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.137:1004
> 2016-05-26 20:12:52,220 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 303914 over-replicated blocks on 10.142.27.137:1004 during recommissioning
> 2016-05-26 20:12:52,220 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.51:1004
> 2016-05-26 20:13:00,362 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 281175 over-replicated blocks on 10.142.27.51:1004 during recommissioning
> 2016-05-26 20:13:00,362 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.12:1004
> 2016-05-26 20:13:08,756 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 274880 over-replicated blocks on 10.142.27.12:1004 during recommissioning
> 2016-05-26 20:13:08,757 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.15:1004
> 2016-05-26 20:13:17,185 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 286334 over-replicated blocks on 10.142.27.15:1004 during recommissioning
> 2016-05-26 20:13:17,185 

[jira] [Comment Edited] (HDFS-10477) Stop decommission a rack of DataNodes caused NameNode fail over to standby

2019-04-03 Thread Wei-Chiu Chuang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-10477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809235#comment-16809235
 ] 

Wei-Chiu Chuang edited comment on HDFS-10477 at 4/3/19 8:36 PM:


Here's the branch-2 patch.  It partially includes HDFS-13027, and I think it 
makes sense to backport HDFS-13027 to branch-2 also. 
[^HDFS-10477.branch-2.patch] 


was (Author: jojochuang):
Here's the branch-2 patch. It partially includes HDFS-13027, and I think it 
makes sense to backport HDFS-13027 to branch-2 also.

> Stop decommission a rack of DataNodes caused NameNode fail over to standby
> --
>
> Key: HDFS-10477
> URL: https://issues.apache.org/jira/browse/HDFS-10477
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.2
>Reporter: yunjiong zhao
>Assignee: yunjiong zhao
>Priority: Major
> Fix For: 3.0.4, 3.3.0, 3.2.1, 3.1.3
>
> Attachments: HDFS-10477.002.patch, HDFS-10477.003.patch, 
> HDFS-10477.004.patch, HDFS-10477.005.patch, HDFS-10477.006.patch, 
> HDFS-10477.007.patch, HDFS-10477.branch-2.patch, HDFS-10477.patch
>
>
> In our cluster, when we stop decommissioning a rack which have 46 DataNodes, 
> it locked Namesystem for about 7 minutes as below log shows:
> {code}
> 2016-05-26 20:11:41,697 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.27:1004
> 2016-05-26 20:11:51,171 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 285258 over-replicated blocks on 10.142.27.27:1004 during recommissioning
> 2016-05-26 20:11:51,171 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.118:1004
> 2016-05-26 20:11:59,972 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 279923 over-replicated blocks on 10.142.27.118:1004 during recommissioning
> 2016-05-26 20:11:59,972 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.113:1004
> 2016-05-26 20:12:09,007 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 294307 over-replicated blocks on 10.142.27.113:1004 during recommissioning
> 2016-05-26 20:12:09,008 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.117:1004
> 2016-05-26 20:12:18,055 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 314381 over-replicated blocks on 10.142.27.117:1004 during recommissioning
> 2016-05-26 20:12:18,056 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.130:1004
> 2016-05-26 20:12:25,938 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 272779 over-replicated blocks on 10.142.27.130:1004 during recommissioning
> 2016-05-26 20:12:25,939 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.121:1004
> 2016-05-26 20:12:34,134 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 287248 over-replicated blocks on 10.142.27.121:1004 during recommissioning
> 2016-05-26 20:12:34,134 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.33:1004
> 2016-05-26 20:12:43,020 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 299868 over-replicated blocks on 10.142.27.33:1004 during recommissioning
> 2016-05-26 20:12:43,020 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.137:1004
> 2016-05-26 20:12:52,220 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 303914 over-replicated blocks on 10.142.27.137:1004 during recommissioning
> 2016-05-26 20:12:52,220 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.51:1004
> 2016-05-26 20:13:00,362 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 281175 over-replicated blocks on 10.142.27.51:1004 during recommissioning
> 2016-05-26 20:13:00,362 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.12:1004
> 2016-05-26 20:13:08,756 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 274880 over-replicated blocks on 10.142.27.12:1004 during recommissioning
> 2016-05-26 20:13:08,757 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.15:1004
> 2016-05-26 20:13:17,185 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 286334 

[jira] [Comment Edited] (HDFS-10477) Stop decommission a rack of DataNodes caused NameNode fail over to standby

2016-07-12 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15373817#comment-15373817
 ] 

Arpit Agarwal edited comment on HDFS-10477 at 7/12/16 10:24 PM:


I second [~kihwal]'s concern about releasing and reacquiring. We have been 
recommending that _dfs.namenode.fslock.fair_ be set to false as it gives better 
overall RPC throughput. 

I am also not sure about releasing the lock during iteration (in 
_processExtraRedundancyBlocksOnReCommission_). I don't recall if the 
BlockIterator is guaranteed to be consistent if the FsNameSystem lock is 
released.

It would have been better for the refreshNodes RPC to start the refresh on a 
worker thread and complete the RPC call immediately. I assume we cannot change 
the RPC behavior now for backwards compatibility.


was (Author: arpitagarwal):
I second [~kihwal]'s concern about releasing and reacquiring. We have been 
recommending that _ dfs.namenode.fslock.fair_ be set to false as it gives 
better overall RPC throughput. 

I am also not sure about releasing the lock during iteration (in 
_processExtraRedundancyBlocksOnReCommission_). I don't recall if the 
BlockIterator is guaranteed to be consistent if the FsNameSystem lock is 
released.

It would have been better for the refreshNodes RPC to start the refresh on a 
worker thread and complete the RPC call immediately. I assume we cannot change 
the RPC behavior now for backwards compatibility.

> Stop decommission a rack of DataNodes caused NameNode fail over to standby
> --
>
> Key: HDFS-10477
> URL: https://issues.apache.org/jira/browse/HDFS-10477
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.2
>Reporter: yunjiong zhao
>Assignee: yunjiong zhao
> Attachments: HDFS-10477.002.patch, HDFS-10477.003.patch, 
> HDFS-10477.patch
>
>
> In our cluster, when we stop decommissioning a rack which have 46 DataNodes, 
> it locked Namesystem for about 7 minutes as below log shows:
> {code}
> 2016-05-26 20:11:41,697 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.27:1004
> 2016-05-26 20:11:51,171 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 285258 over-replicated blocks on 10.142.27.27:1004 during recommissioning
> 2016-05-26 20:11:51,171 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.118:1004
> 2016-05-26 20:11:59,972 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 279923 over-replicated blocks on 10.142.27.118:1004 during recommissioning
> 2016-05-26 20:11:59,972 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.113:1004
> 2016-05-26 20:12:09,007 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 294307 over-replicated blocks on 10.142.27.113:1004 during recommissioning
> 2016-05-26 20:12:09,008 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.117:1004
> 2016-05-26 20:12:18,055 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 314381 over-replicated blocks on 10.142.27.117:1004 during recommissioning
> 2016-05-26 20:12:18,056 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.130:1004
> 2016-05-26 20:12:25,938 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 272779 over-replicated blocks on 10.142.27.130:1004 during recommissioning
> 2016-05-26 20:12:25,939 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.121:1004
> 2016-05-26 20:12:34,134 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 287248 over-replicated blocks on 10.142.27.121:1004 during recommissioning
> 2016-05-26 20:12:34,134 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.33:1004
> 2016-05-26 20:12:43,020 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 299868 over-replicated blocks on 10.142.27.33:1004 during recommissioning
> 2016-05-26 20:12:43,020 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.137:1004
> 2016-05-26 20:12:52,220 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 303914 over-replicated blocks on 10.142.27.137:1004 during recommissioning
> 2016-05-26 20:12:52,220 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.51:1004
> 2016-05-26 20:13:00,362 

[jira] [Comment Edited] (HDFS-10477) Stop decommission a rack of DataNodes caused NameNode fail over to standby

2016-06-02 Thread Benoy Antony (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15313282#comment-15313282
 ] 

Benoy Antony edited comment on HDFS-10477 at 6/2/16 11:21 PM:
--

If its possible to release lock per storage, then that's better. 
If not , I prefer the first version which releases the lock per each datanode 
without the additional processing. 
The logs show that  the each node is processed in around 10 seconds. 


was (Author: benoyantony):
If its possible to release lock per storage, then that's better. 
If not , I prefer the first version which does releases the lock per each 
datanode without the additional processing. 
The logs show that  the each node is processed in around 10 seconds. 

> Stop decommission a rack of DataNodes caused NameNode fail over to standby
> --
>
> Key: HDFS-10477
> URL: https://issues.apache.org/jira/browse/HDFS-10477
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.2
>Reporter: yunjiong zhao
>Assignee: yunjiong zhao
> Attachments: HDFS-10477.002.patch, HDFS-10477.patch
>
>
> In our cluster, when we stop decommissioning a rack which have 46 DataNodes, 
> it locked Namesystem for about 7 minutes as below log shows:
> {code}
> 2016-05-26 20:11:41,697 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.27:1004
> 2016-05-26 20:11:51,171 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 285258 over-replicated blocks on 10.142.27.27:1004 during recommissioning
> 2016-05-26 20:11:51,171 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.118:1004
> 2016-05-26 20:11:59,972 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 279923 over-replicated blocks on 10.142.27.118:1004 during recommissioning
> 2016-05-26 20:11:59,972 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.113:1004
> 2016-05-26 20:12:09,007 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 294307 over-replicated blocks on 10.142.27.113:1004 during recommissioning
> 2016-05-26 20:12:09,008 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.117:1004
> 2016-05-26 20:12:18,055 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 314381 over-replicated blocks on 10.142.27.117:1004 during recommissioning
> 2016-05-26 20:12:18,056 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.130:1004
> 2016-05-26 20:12:25,938 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 272779 over-replicated blocks on 10.142.27.130:1004 during recommissioning
> 2016-05-26 20:12:25,939 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.121:1004
> 2016-05-26 20:12:34,134 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 287248 over-replicated blocks on 10.142.27.121:1004 during recommissioning
> 2016-05-26 20:12:34,134 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.33:1004
> 2016-05-26 20:12:43,020 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 299868 over-replicated blocks on 10.142.27.33:1004 during recommissioning
> 2016-05-26 20:12:43,020 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.137:1004
> 2016-05-26 20:12:52,220 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 303914 over-replicated blocks on 10.142.27.137:1004 during recommissioning
> 2016-05-26 20:12:52,220 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.51:1004
> 2016-05-26 20:13:00,362 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 281175 over-replicated blocks on 10.142.27.51:1004 during recommissioning
> 2016-05-26 20:13:00,362 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.12:1004
> 2016-05-26 20:13:08,756 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 274880 over-replicated blocks on 10.142.27.12:1004 during recommissioning
> 2016-05-26 20:13:08,757 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.15:1004
> 2016-05-26 20:13:17,185 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 286334