shanyu zhao created HDFS-15451:
----------------------------------
Summary: Restarting name node stuck in safe mode when using
provided storage
Key: HDFS-15451
URL: https://issues.apache.org/jira/browse/HDFS-15451
Project: Hadoop HDFS
Issue Type: Bug
Components: namenode
Affects Versions: 3.1.3, 3.2.1
Reporter: shanyu zhao
When HDFS provided storage is used (dfs.namenode.provided.enabled=true),
sometimes restarting name node will result in it stuck at safe mode.
The problem is that data node send block report to name node successfully, but
name node is not processing the report properly, then HDFS remains in safe mode
due to missing blocks.
Looking at name node log, this is the sequence of log for a specific data node:
{code}
2020-07-01 19:46:41,997 INFO blockmanagement.BlockReportLeaseManager:
Registered DN af19d9e0-7b9b-45e0-9aa6-b2f404098084 (10.244.6.131:9866).
2020-07-01 19:46:42,012 DEBUG blockmanagement.BlockReportLeaseManager: Created
a new BR lease 0x476aaae689ebbc01 for DN af19d9e0-7b9b-45e0-9aa6-b2f404098084.
numPending = 4
2020-07-01 19:46:42,340 INFO BlockStateChange: BLOCK* processReport
0xcc610f42d0218cd9: discarded non-initial block report from
DatanodeRegistration(10.244.6.131:9866,
datanodeUuid=af19d9e0-7b9b-45e0-9aa6-b2f404098084, infoPort=0,
infoSecurePort=9865, ipcPort=9867,
storageInfo=lv=-57;cid=CID-f49d3421-e04f-40b9-89ef-cf4fee73ad6a;nsid=497894240;c=1572548424451)
because namenode still in startup phase
2020-07-01 19:46:42,648 WARN blockmanagement.BlockReportLeaseManager: BR lease
0x476aaae689ebbc01 is not valid for DN af19d9e0-7b9b-45e0-9aa6-b2f404098084,
because the DN is not in the pending set.
{code}
The root cause is when BlockManager is processing report, it will skip
processing when storageInfo.getBlockReportCount() > 0 and remove the lease:
{code}
blockReportLeaseManager.removeLease(node)
{code}
This is because every data node will report a DS-PROVIDED storage, along with
other storages (like DISK storage). All DS -PROVIDED storages are actually
pointing to the same storageInfo, therefore the second data node sending block
report with DS-PROVIDED will have blockReportCount > 0. Then the lease is
removed for the data node, then processing future block reports from this node
will fail at checkLease() with message "BR lease is not valid".
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]