[jira] [Comment Edited] (HDFS-10477) Stop decommission a rack of DataNodes caused NameNode fail over to standby
[ https://issues.apache.org/jira/browse/HDFS-10477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809801#comment-16809801 ] Wei-Chiu Chuang edited comment on HDFS-10477 at 4/4/19 12:40 PM: - Findbugs warning unrelated, caused by a previous commit. Will file a Jira to clean it up (HDFS-14414). was (Author: jojochuang): Findbugs warning unrelated, caused by a previous commit. Will file a Jira to clean it up. > Stop decommission a rack of DataNodes caused NameNode fail over to standby > -- > > Key: HDFS-10477 > URL: https://issues.apache.org/jira/browse/HDFS-10477 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2 >Reporter: yunjiong zhao >Assignee: yunjiong zhao >Priority: Major > Fix For: 3.0.4, 3.3.0, 3.2.1, 3.1.3 > > Attachments: HDFS-10477.002.patch, HDFS-10477.003.patch, > HDFS-10477.004.patch, HDFS-10477.005.patch, HDFS-10477.006.patch, > HDFS-10477.007.patch, HDFS-10477.branch-2.patch, HDFS-10477.patch > > > In our cluster, when we stop decommissioning a rack which have 46 DataNodes, > it locked Namesystem for about 7 minutes as below log shows: > {code} > 2016-05-26 20:11:41,697 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.27:1004 > 2016-05-26 20:11:51,171 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 285258 over-replicated blocks on 10.142.27.27:1004 during recommissioning > 2016-05-26 20:11:51,171 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.118:1004 > 2016-05-26 20:11:59,972 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 279923 over-replicated blocks on 10.142.27.118:1004 during recommissioning > 2016-05-26 20:11:59,972 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.113:1004 > 2016-05-26 20:12:09,007 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 294307 over-replicated blocks on 10.142.27.113:1004 during recommissioning > 2016-05-26 20:12:09,008 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.117:1004 > 2016-05-26 20:12:18,055 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 314381 over-replicated blocks on 10.142.27.117:1004 during recommissioning > 2016-05-26 20:12:18,056 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.130:1004 > 2016-05-26 20:12:25,938 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 272779 over-replicated blocks on 10.142.27.130:1004 during recommissioning > 2016-05-26 20:12:25,939 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.121:1004 > 2016-05-26 20:12:34,134 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 287248 over-replicated blocks on 10.142.27.121:1004 during recommissioning > 2016-05-26 20:12:34,134 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.33:1004 > 2016-05-26 20:12:43,020 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 299868 over-replicated blocks on 10.142.27.33:1004 during recommissioning > 2016-05-26 20:12:43,020 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.137:1004 > 2016-05-26 20:12:52,220 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 303914 over-replicated blocks on 10.142.27.137:1004 during recommissioning > 2016-05-26 20:12:52,220 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.51:1004 > 2016-05-26 20:13:00,362 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 281175 over-replicated blocks on 10.142.27.51:1004 during recommissioning > 2016-05-26 20:13:00,362 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.12:1004 > 2016-05-26 20:13:08,756 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 274880 over-replicated blocks on 10.142.27.12:1004 during recommissioning > 2016-05-26 20:13:08,757 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.15:1004 > 2016-05-26 20:13:17,185 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 286334 over-replicated blocks on 10.142.27.15:1004 during recommissioning > 2016-05-26 20:13:17,185
[jira] [Comment Edited] (HDFS-10477) Stop decommission a rack of DataNodes caused NameNode fail over to standby
[ https://issues.apache.org/jira/browse/HDFS-10477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809235#comment-16809235 ] Wei-Chiu Chuang edited comment on HDFS-10477 at 4/3/19 8:36 PM: Here's the branch-2 patch. It partially includes HDFS-13027, and I think it makes sense to backport HDFS-13027 to branch-2 also. [^HDFS-10477.branch-2.patch] was (Author: jojochuang): Here's the branch-2 patch. It partially includes HDFS-13027, and I think it makes sense to backport HDFS-13027 to branch-2 also. > Stop decommission a rack of DataNodes caused NameNode fail over to standby > -- > > Key: HDFS-10477 > URL: https://issues.apache.org/jira/browse/HDFS-10477 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2 >Reporter: yunjiong zhao >Assignee: yunjiong zhao >Priority: Major > Fix For: 3.0.4, 3.3.0, 3.2.1, 3.1.3 > > Attachments: HDFS-10477.002.patch, HDFS-10477.003.patch, > HDFS-10477.004.patch, HDFS-10477.005.patch, HDFS-10477.006.patch, > HDFS-10477.007.patch, HDFS-10477.branch-2.patch, HDFS-10477.patch > > > In our cluster, when we stop decommissioning a rack which have 46 DataNodes, > it locked Namesystem for about 7 minutes as below log shows: > {code} > 2016-05-26 20:11:41,697 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.27:1004 > 2016-05-26 20:11:51,171 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 285258 over-replicated blocks on 10.142.27.27:1004 during recommissioning > 2016-05-26 20:11:51,171 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.118:1004 > 2016-05-26 20:11:59,972 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 279923 over-replicated blocks on 10.142.27.118:1004 during recommissioning > 2016-05-26 20:11:59,972 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.113:1004 > 2016-05-26 20:12:09,007 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 294307 over-replicated blocks on 10.142.27.113:1004 during recommissioning > 2016-05-26 20:12:09,008 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.117:1004 > 2016-05-26 20:12:18,055 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 314381 over-replicated blocks on 10.142.27.117:1004 during recommissioning > 2016-05-26 20:12:18,056 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.130:1004 > 2016-05-26 20:12:25,938 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 272779 over-replicated blocks on 10.142.27.130:1004 during recommissioning > 2016-05-26 20:12:25,939 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.121:1004 > 2016-05-26 20:12:34,134 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 287248 over-replicated blocks on 10.142.27.121:1004 during recommissioning > 2016-05-26 20:12:34,134 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.33:1004 > 2016-05-26 20:12:43,020 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 299868 over-replicated blocks on 10.142.27.33:1004 during recommissioning > 2016-05-26 20:12:43,020 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.137:1004 > 2016-05-26 20:12:52,220 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 303914 over-replicated blocks on 10.142.27.137:1004 during recommissioning > 2016-05-26 20:12:52,220 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.51:1004 > 2016-05-26 20:13:00,362 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 281175 over-replicated blocks on 10.142.27.51:1004 during recommissioning > 2016-05-26 20:13:00,362 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.12:1004 > 2016-05-26 20:13:08,756 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 274880 over-replicated blocks on 10.142.27.12:1004 during recommissioning > 2016-05-26 20:13:08,757 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.15:1004 > 2016-05-26 20:13:17,185 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 286334
[jira] [Comment Edited] (HDFS-10477) Stop decommission a rack of DataNodes caused NameNode fail over to standby
[ https://issues.apache.org/jira/browse/HDFS-10477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15373817#comment-15373817 ] Arpit Agarwal edited comment on HDFS-10477 at 7/12/16 10:24 PM: I second [~kihwal]'s concern about releasing and reacquiring. We have been recommending that _dfs.namenode.fslock.fair_ be set to false as it gives better overall RPC throughput. I am also not sure about releasing the lock during iteration (in _processExtraRedundancyBlocksOnReCommission_). I don't recall if the BlockIterator is guaranteed to be consistent if the FsNameSystem lock is released. It would have been better for the refreshNodes RPC to start the refresh on a worker thread and complete the RPC call immediately. I assume we cannot change the RPC behavior now for backwards compatibility. was (Author: arpitagarwal): I second [~kihwal]'s concern about releasing and reacquiring. We have been recommending that _ dfs.namenode.fslock.fair_ be set to false as it gives better overall RPC throughput. I am also not sure about releasing the lock during iteration (in _processExtraRedundancyBlocksOnReCommission_). I don't recall if the BlockIterator is guaranteed to be consistent if the FsNameSystem lock is released. It would have been better for the refreshNodes RPC to start the refresh on a worker thread and complete the RPC call immediately. I assume we cannot change the RPC behavior now for backwards compatibility. > Stop decommission a rack of DataNodes caused NameNode fail over to standby > -- > > Key: HDFS-10477 > URL: https://issues.apache.org/jira/browse/HDFS-10477 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2 >Reporter: yunjiong zhao >Assignee: yunjiong zhao > Attachments: HDFS-10477.002.patch, HDFS-10477.003.patch, > HDFS-10477.patch > > > In our cluster, when we stop decommissioning a rack which have 46 DataNodes, > it locked Namesystem for about 7 minutes as below log shows: > {code} > 2016-05-26 20:11:41,697 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.27:1004 > 2016-05-26 20:11:51,171 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 285258 over-replicated blocks on 10.142.27.27:1004 during recommissioning > 2016-05-26 20:11:51,171 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.118:1004 > 2016-05-26 20:11:59,972 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 279923 over-replicated blocks on 10.142.27.118:1004 during recommissioning > 2016-05-26 20:11:59,972 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.113:1004 > 2016-05-26 20:12:09,007 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 294307 over-replicated blocks on 10.142.27.113:1004 during recommissioning > 2016-05-26 20:12:09,008 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.117:1004 > 2016-05-26 20:12:18,055 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 314381 over-replicated blocks on 10.142.27.117:1004 during recommissioning > 2016-05-26 20:12:18,056 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.130:1004 > 2016-05-26 20:12:25,938 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 272779 over-replicated blocks on 10.142.27.130:1004 during recommissioning > 2016-05-26 20:12:25,939 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.121:1004 > 2016-05-26 20:12:34,134 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 287248 over-replicated blocks on 10.142.27.121:1004 during recommissioning > 2016-05-26 20:12:34,134 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.33:1004 > 2016-05-26 20:12:43,020 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 299868 over-replicated blocks on 10.142.27.33:1004 during recommissioning > 2016-05-26 20:12:43,020 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.137:1004 > 2016-05-26 20:12:52,220 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 303914 over-replicated blocks on 10.142.27.137:1004 during recommissioning > 2016-05-26 20:12:52,220 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.51:1004 > 2016-05-26 20:13:00,362
[jira] [Comment Edited] (HDFS-10477) Stop decommission a rack of DataNodes caused NameNode fail over to standby
[ https://issues.apache.org/jira/browse/HDFS-10477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15313282#comment-15313282 ] Benoy Antony edited comment on HDFS-10477 at 6/2/16 11:21 PM: -- If its possible to release lock per storage, then that's better. If not , I prefer the first version which releases the lock per each datanode without the additional processing. The logs show that the each node is processed in around 10 seconds. was (Author: benoyantony): If its possible to release lock per storage, then that's better. If not , I prefer the first version which does releases the lock per each datanode without the additional processing. The logs show that the each node is processed in around 10 seconds. > Stop decommission a rack of DataNodes caused NameNode fail over to standby > -- > > Key: HDFS-10477 > URL: https://issues.apache.org/jira/browse/HDFS-10477 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2 >Reporter: yunjiong zhao >Assignee: yunjiong zhao > Attachments: HDFS-10477.002.patch, HDFS-10477.patch > > > In our cluster, when we stop decommissioning a rack which have 46 DataNodes, > it locked Namesystem for about 7 minutes as below log shows: > {code} > 2016-05-26 20:11:41,697 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.27:1004 > 2016-05-26 20:11:51,171 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 285258 over-replicated blocks on 10.142.27.27:1004 during recommissioning > 2016-05-26 20:11:51,171 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.118:1004 > 2016-05-26 20:11:59,972 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 279923 over-replicated blocks on 10.142.27.118:1004 during recommissioning > 2016-05-26 20:11:59,972 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.113:1004 > 2016-05-26 20:12:09,007 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 294307 over-replicated blocks on 10.142.27.113:1004 during recommissioning > 2016-05-26 20:12:09,008 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.117:1004 > 2016-05-26 20:12:18,055 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 314381 over-replicated blocks on 10.142.27.117:1004 during recommissioning > 2016-05-26 20:12:18,056 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.130:1004 > 2016-05-26 20:12:25,938 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 272779 over-replicated blocks on 10.142.27.130:1004 during recommissioning > 2016-05-26 20:12:25,939 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.121:1004 > 2016-05-26 20:12:34,134 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 287248 over-replicated blocks on 10.142.27.121:1004 during recommissioning > 2016-05-26 20:12:34,134 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.33:1004 > 2016-05-26 20:12:43,020 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 299868 over-replicated blocks on 10.142.27.33:1004 during recommissioning > 2016-05-26 20:12:43,020 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.137:1004 > 2016-05-26 20:12:52,220 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 303914 over-replicated blocks on 10.142.27.137:1004 during recommissioning > 2016-05-26 20:12:52,220 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.51:1004 > 2016-05-26 20:13:00,362 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 281175 over-replicated blocks on 10.142.27.51:1004 during recommissioning > 2016-05-26 20:13:00,362 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.12:1004 > 2016-05-26 20:13:08,756 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 274880 over-replicated blocks on 10.142.27.12:1004 during recommissioning > 2016-05-26 20:13:08,757 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.15:1004 > 2016-05-26 20:13:17,185 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 286334