[jira] [Commented] (HDFS-15451) Restarting name node stuck in safe mode when using provided storage
[ https://issues.apache.org/jira/browse/HDFS-15451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152470#comment-17152470 ] Xiaoqiao He commented on HDFS-15451: cherrypick to branch-3.3, branch-3.2 and branch-3.1. Thanks [~shanyu]. > Restarting name node stuck in safe mode when using provided storage > --- > > Key: HDFS-15451 > URL: https://issues.apache.org/jira/browse/HDFS-15451 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Reporter: shanyu zhao >Assignee: shanyu zhao >Priority: Major > Fix For: 3.2.2, 3.3.1, 3.4.0, 3.1.5 > > > When HDFS provided storage is used (dfs.namenode.provided.enabled=true), > sometimes restarting name node will result in it stuck at safe mode. > The problem is that data node send block report to name node successfully, > but name node is not processing the report properly, then HDFS remains in > safe mode due to missing blocks. > Looking at name node log, this is the sequence of log for a specific data > node: > {code} > 2020-07-01 19:46:41,997 INFO blockmanagement.BlockReportLeaseManager: > Registered DN af19d9e0-7b9b-45e0-9aa6-b2f404098084 (10.244.6.131:9866). > 2020-07-01 19:46:42,012 DEBUG blockmanagement.BlockReportLeaseManager: > Created a new BR lease 0x476aaae689ebbc01 for DN > af19d9e0-7b9b-45e0-9aa6-b2f404098084. numPending = 4 > 2020-07-01 19:46:42,340 INFO BlockStateChange: BLOCK* processReport > 0xcc610f42d0218cd9: discarded non-initial block report from > DatanodeRegistration(10.244.6.131:9866, > datanodeUuid=af19d9e0-7b9b-45e0-9aa6-b2f404098084, infoPort=0, > infoSecurePort=9865, ipcPort=9867, > storageInfo=lv=-57;cid=CID-f49d3421-e04f-40b9-89ef-cf4fee73ad6a;nsid=497894240;c=1572548424451) > because namenode still in startup phase > 2020-07-01 19:46:42,648 WARN blockmanagement.BlockReportLeaseManager: BR > lease 0x476aaae689ebbc01 is not valid for DN > af19d9e0-7b9b-45e0-9aa6-b2f404098084, because the DN is not in the pending > set. > {code} > The root cause is when BlockManager is processing report, it will skip > processing when storageInfo.getBlockReportCount() > 0 and remove the lease: > {code} > blockReportLeaseManager.removeLease(node) > {code} > This is because every data node will report a DS-PROVIDED storage, along with > other storages (like DISK storage). All DS -PROVIDED storages are actually > pointing to the same storageInfo, therefore the second data node sending > block report with DS-PROVIDED will have blockReportCount > 0. Then the lease > is removed for the data node, then processing future block reports from this > node will fail at checkLease() with message "BR lease is not valid". -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15451) Restarting name node stuck in safe mode when using provided storage
[ https://issues.apache.org/jira/browse/HDFS-15451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152251#comment-17152251 ] shanyu zhao commented on HDFS-15451: Thank you [~hexiaoqiao] and [~virajith]! Is it possible to also back port it to branch-3.1, branch-3.2 and branch-3.3? > Restarting name node stuck in safe mode when using provided storage > --- > > Key: HDFS-15451 > URL: https://issues.apache.org/jira/browse/HDFS-15451 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.1, 3.1.3 >Reporter: shanyu zhao >Assignee: shanyu zhao >Priority: Major > Fix For: 3.4.0 > > > When HDFS provided storage is used (dfs.namenode.provided.enabled=true), > sometimes restarting name node will result in it stuck at safe mode. > The problem is that data node send block report to name node successfully, > but name node is not processing the report properly, then HDFS remains in > safe mode due to missing blocks. > Looking at name node log, this is the sequence of log for a specific data > node: > {code} > 2020-07-01 19:46:41,997 INFO blockmanagement.BlockReportLeaseManager: > Registered DN af19d9e0-7b9b-45e0-9aa6-b2f404098084 (10.244.6.131:9866). > 2020-07-01 19:46:42,012 DEBUG blockmanagement.BlockReportLeaseManager: > Created a new BR lease 0x476aaae689ebbc01 for DN > af19d9e0-7b9b-45e0-9aa6-b2f404098084. numPending = 4 > 2020-07-01 19:46:42,340 INFO BlockStateChange: BLOCK* processReport > 0xcc610f42d0218cd9: discarded non-initial block report from > DatanodeRegistration(10.244.6.131:9866, > datanodeUuid=af19d9e0-7b9b-45e0-9aa6-b2f404098084, infoPort=0, > infoSecurePort=9865, ipcPort=9867, > storageInfo=lv=-57;cid=CID-f49d3421-e04f-40b9-89ef-cf4fee73ad6a;nsid=497894240;c=1572548424451) > because namenode still in startup phase > 2020-07-01 19:46:42,648 WARN blockmanagement.BlockReportLeaseManager: BR > lease 0x476aaae689ebbc01 is not valid for DN > af19d9e0-7b9b-45e0-9aa6-b2f404098084, because the DN is not in the pending > set. > {code} > The root cause is when BlockManager is processing report, it will skip > processing when storageInfo.getBlockReportCount() > 0 and remove the lease: > {code} > blockReportLeaseManager.removeLease(node) > {code} > This is because every data node will report a DS-PROVIDED storage, along with > other storages (like DISK storage). All DS -PROVIDED storages are actually > pointing to the same storageInfo, therefore the second data node sending > block report with DS-PROVIDED will have blockReportCount > 0. Then the lease > is removed for the data node, then processing future block reports from this > node will fail at checkLease() with message "BR lease is not valid". -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15451) Restarting name node stuck in safe mode when using provided storage
[ https://issues.apache.org/jira/browse/HDFS-15451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152124#comment-17152124 ] Hudson commented on HDFS-15451: --- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #18412 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/18412/]) HDFS-15451. Do not discard non-initial block report for provided (github: rev 834372f4040f1e7a00720da5c40407f9b1423b6d) * (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java * (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestBlockManager.java > Restarting name node stuck in safe mode when using provided storage > --- > > Key: HDFS-15451 > URL: https://issues.apache.org/jira/browse/HDFS-15451 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.1, 3.1.3 >Reporter: shanyu zhao >Assignee: shanyu zhao >Priority: Major > Fix For: 3.4.0 > > > When HDFS provided storage is used (dfs.namenode.provided.enabled=true), > sometimes restarting name node will result in it stuck at safe mode. > The problem is that data node send block report to name node successfully, > but name node is not processing the report properly, then HDFS remains in > safe mode due to missing blocks. > Looking at name node log, this is the sequence of log for a specific data > node: > {code} > 2020-07-01 19:46:41,997 INFO blockmanagement.BlockReportLeaseManager: > Registered DN af19d9e0-7b9b-45e0-9aa6-b2f404098084 (10.244.6.131:9866). > 2020-07-01 19:46:42,012 DEBUG blockmanagement.BlockReportLeaseManager: > Created a new BR lease 0x476aaae689ebbc01 for DN > af19d9e0-7b9b-45e0-9aa6-b2f404098084. numPending = 4 > 2020-07-01 19:46:42,340 INFO BlockStateChange: BLOCK* processReport > 0xcc610f42d0218cd9: discarded non-initial block report from > DatanodeRegistration(10.244.6.131:9866, > datanodeUuid=af19d9e0-7b9b-45e0-9aa6-b2f404098084, infoPort=0, > infoSecurePort=9865, ipcPort=9867, > storageInfo=lv=-57;cid=CID-f49d3421-e04f-40b9-89ef-cf4fee73ad6a;nsid=497894240;c=1572548424451) > because namenode still in startup phase > 2020-07-01 19:46:42,648 WARN blockmanagement.BlockReportLeaseManager: BR > lease 0x476aaae689ebbc01 is not valid for DN > af19d9e0-7b9b-45e0-9aa6-b2f404098084, because the DN is not in the pending > set. > {code} > The root cause is when BlockManager is processing report, it will skip > processing when storageInfo.getBlockReportCount() > 0 and remove the lease: > {code} > blockReportLeaseManager.removeLease(node) > {code} > This is because every data node will report a DS-PROVIDED storage, along with > other storages (like DISK storage). All DS -PROVIDED storages are actually > pointing to the same storageInfo, therefore the second data node sending > block report with DS-PROVIDED will have blockReportCount > 0. Then the lease > is removed for the data node, then processing future block reports from this > node will fail at checkLease() with message "BR lease is not valid". -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15451) Restarting name node stuck in safe mode when using provided storage
[ https://issues.apache.org/jira/browse/HDFS-15451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150312#comment-17150312 ] Virajith Jalaparti commented on HDFS-15451: --- Thanks for finding/fixing this [~shanyu]. > Restarting name node stuck in safe mode when using provided storage > --- > > Key: HDFS-15451 > URL: https://issues.apache.org/jira/browse/HDFS-15451 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.1, 3.1.3 >Reporter: shanyu zhao >Assignee: shanyu zhao >Priority: Major > > When HDFS provided storage is used (dfs.namenode.provided.enabled=true), > sometimes restarting name node will result in it stuck at safe mode. > The problem is that data node send block report to name node successfully, > but name node is not processing the report properly, then HDFS remains in > safe mode due to missing blocks. > Looking at name node log, this is the sequence of log for a specific data > node: > {code} > 2020-07-01 19:46:41,997 INFO blockmanagement.BlockReportLeaseManager: > Registered DN af19d9e0-7b9b-45e0-9aa6-b2f404098084 (10.244.6.131:9866). > 2020-07-01 19:46:42,012 DEBUG blockmanagement.BlockReportLeaseManager: > Created a new BR lease 0x476aaae689ebbc01 for DN > af19d9e0-7b9b-45e0-9aa6-b2f404098084. numPending = 4 > 2020-07-01 19:46:42,340 INFO BlockStateChange: BLOCK* processReport > 0xcc610f42d0218cd9: discarded non-initial block report from > DatanodeRegistration(10.244.6.131:9866, > datanodeUuid=af19d9e0-7b9b-45e0-9aa6-b2f404098084, infoPort=0, > infoSecurePort=9865, ipcPort=9867, > storageInfo=lv=-57;cid=CID-f49d3421-e04f-40b9-89ef-cf4fee73ad6a;nsid=497894240;c=1572548424451) > because namenode still in startup phase > 2020-07-01 19:46:42,648 WARN blockmanagement.BlockReportLeaseManager: BR > lease 0x476aaae689ebbc01 is not valid for DN > af19d9e0-7b9b-45e0-9aa6-b2f404098084, because the DN is not in the pending > set. > {code} > The root cause is when BlockManager is processing report, it will skip > processing when storageInfo.getBlockReportCount() > 0 and remove the lease: > {code} > blockReportLeaseManager.removeLease(node) > {code} > This is because every data node will report a DS-PROVIDED storage, along with > other storages (like DISK storage). All DS -PROVIDED storages are actually > pointing to the same storageInfo, therefore the second data node sending > block report with DS-PROVIDED will have blockReportCount > 0. Then the lease > is removed for the data node, then processing future block reports from this > node will fail at checkLease() with message "BR lease is not valid". -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15451) Restarting name node stuck in safe mode when using provided storage
[ https://issues.apache.org/jira/browse/HDFS-15451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17149861#comment-17149861 ] shanyu zhao commented on HDFS-15451: Pull request submitted: https://github.com/apache/hadoop/pull/2119 > Restarting name node stuck in safe mode when using provided storage > --- > > Key: HDFS-15451 > URL: https://issues.apache.org/jira/browse/HDFS-15451 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.1, 3.1.3 >Reporter: shanyu zhao >Priority: Major > > When HDFS provided storage is used (dfs.namenode.provided.enabled=true), > sometimes restarting name node will result in it stuck at safe mode. > The problem is that data node send block report to name node successfully, > but name node is not processing the report properly, then HDFS remains in > safe mode due to missing blocks. > Looking at name node log, this is the sequence of log for a specific data > node: > {code} > 2020-07-01 19:46:41,997 INFO blockmanagement.BlockReportLeaseManager: > Registered DN af19d9e0-7b9b-45e0-9aa6-b2f404098084 (10.244.6.131:9866). > 2020-07-01 19:46:42,012 DEBUG blockmanagement.BlockReportLeaseManager: > Created a new BR lease 0x476aaae689ebbc01 for DN > af19d9e0-7b9b-45e0-9aa6-b2f404098084. numPending = 4 > 2020-07-01 19:46:42,340 INFO BlockStateChange: BLOCK* processReport > 0xcc610f42d0218cd9: discarded non-initial block report from > DatanodeRegistration(10.244.6.131:9866, > datanodeUuid=af19d9e0-7b9b-45e0-9aa6-b2f404098084, infoPort=0, > infoSecurePort=9865, ipcPort=9867, > storageInfo=lv=-57;cid=CID-f49d3421-e04f-40b9-89ef-cf4fee73ad6a;nsid=497894240;c=1572548424451) > because namenode still in startup phase > 2020-07-01 19:46:42,648 WARN blockmanagement.BlockReportLeaseManager: BR > lease 0x476aaae689ebbc01 is not valid for DN > af19d9e0-7b9b-45e0-9aa6-b2f404098084, because the DN is not in the pending > set. > {code} > The root cause is when BlockManager is processing report, it will skip > processing when storageInfo.getBlockReportCount() > 0 and remove the lease: > {code} > blockReportLeaseManager.removeLease(node) > {code} > This is because every data node will report a DS-PROVIDED storage, along with > other storages (like DISK storage). All DS -PROVIDED storages are actually > pointing to the same storageInfo, therefore the second data node sending > block report with DS-PROVIDED will have blockReportCount > 0. Then the lease > is removed for the data node, then processing future block reports from this > node will fail at checkLease() with message "BR lease is not valid". -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org