[ 
https://issues.apache.org/jira/browse/HDFS-15421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140597#comment-17140597
 ] 

Kihwal Lee commented on HDFS-15421:
-----------------------------------

This is an example of "stuck safe mode" from one of our small test clusters:
{noformat}
The reported blocks 3045352 needs additional 14058 blocks to reach the 
threshold 1.0000 of total blocks 3059410.
The minimum number of live datanodes is not required. Safe mode will be turned 
off automatically once the thresholds
 have been reached.
2020-06-11 18:35:19,863 [Block report processor] INFO hdfs.StateChange: STATE* 
Safe mode extension entered.
The reported blocks 3059410 has reached the threshold 1.0000 of total blocks 
3059410. The minimum number
 of live datanodes is not required. In safe mode extension. Safe mode will be 
turned off automatically in 30 seconds.
2020-06-11 18:35:25,036 [Edit log tailer] INFO namenode.FSImage:
 Reading 
org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@259766e0
 expecting start txid #3427451497
2020-06-11 18:35:25,036 [Edit log tailer] INFO namenode.FSImage: Start loading 
edits file xxx
2020-06-11 18:35:25,036 [Edit log tailer] INFO 
namenode.RedundantEditLogInputStream: Fast-forwarding stream 'xxx'
 to transaction ID 3427451497
2020-06-11 18:35:25,060 [Edit log tailer] INFO namenode.FSImage: Loaded 1 edits 
file(s) (the last named
 xxx of total size 19024.0, total edits 124.0, total load time 25.0 ms
2020-06-11 18:35:39,868 
[org.apache.hadoop.hdfs.server.blockmanagement.BlockManagerSafeMode$SafeModeMonitor@6d4a65c6]
 INFO hdfs.StateChange: STATE* Safe mode ON, in safe mode extension.
The reported blocks 3059416 needs additional 1 blocks to reach the threshold 
1.0000 of total blocks 3059417.
The minimum number of live datanodes is not required. In safe mode extension.
 Safe mode will be turned off automatically in 9 seconds.
2020-06-11 18:35:59,873 
[org.apache.hadoop.hdfs.server.blockmanagement.BlockManagerSafeMode$SafeModeMonitor@6d4a65c6]
 INFO hdfs.StateChange: STATE* Safe mode ON, thresholds not met.
The reported blocks 3059416 needs additional 1 blocks to reach the threshold 
1.0000 of total blocks 3059417.
The minimum number of live datanodes is not required. In safe mode extension.
 Safe mode will be turned off automatically in -10 seconds.
2020-06-11 18:36:19,880 
[org.apache.hadoop.hdfs.server.blockmanagement.BlockManagerSafeMode$SafeModeMonitor@6d4a65c6]
 INFO hdfs.StateChange: STATE* Safe mode ON, thresholds not met.
The reported blocks 3059416 needs additional 1 blocks to reach the threshold 
1.0000 of total blocks 3059417.
The minimum number of live datanodes is not required. In safe mode extension.
 Safe mode will be turned off automatically in -30 seconds.
2020-06-11 18:36:39,888 
[org.apache.hadoop.hdfs.server.blockmanagement.BlockManagerSafeMode$SafeModeMonitor@6d4a65c6]
 INFO hdfs.StateChange: STATE* Safe mode ON, thresholds not met.
The reported blocks 3059416 needs additional 1 blocks to reach the threshold 
1.0000 of total blocks 3059417.
{noformat}

The time in extension indefinitely grows negatively and the additionally 
required blocks increase as more IBRs leak.  You can force it out of safe mode, 
but the leak continues until a HA transition.

> IBR leak causes standby NN to be stuck in safe mode
> ---------------------------------------------------
>
>                 Key: HDFS-15421
>                 URL: https://issues.apache.org/jira/browse/HDFS-15421
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>            Reporter: Kihwal Lee
>            Priority: Critical
>
> After HDFS-14941, update of the global gen stamp is delayed in certain 
> situations.  This makes the last set of incremental block reports from append 
> "from future", which causes it to be simply re-queued to the pending DN 
> message queue, rather than processed to complete the block.  The last set of 
> IBRs will leak and never cleaned until it transitions to active.  The size of 
> {{pendingDNMessages}} constantly grows until then.
> If a leak happens while in a startup safe mode, the namenode will never be 
> able to come out of safe mode on its own.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to