HuangTao created HDFS-15274:
-------------------------------
Summary: NN doesn't remove the blocks from the failed
DatanodeStorageInfo
Key: HDFS-15274
URL: https://issues.apache.org/jira/browse/HDFS-15274
Project: Hadoop HDFS
Issue Type: Bug
Components: namenode
Reporter: HuangTao
Assignee: HuangTao
Fix For: 3.4.0
In our federation cluster, we found there were some inconsistency failure
volumes between two namespaces. The following logs are two NS separately.
NS1 received the failed storage info and removed the blocks associated with the
failed storage.
{code:java}
[INFO] [IPC Server handler 76 on 8021] : Number of failed storages changes from
0 to 1
[INFO] [IPC Server handler 76 on 8021] :
[DISK]DS-298de29e-9104-48dd-a674-5443a6126969:NORMAL:X.X.X.X:50010:/data0/dfs
failed.
[INFO]
[org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager$Monitor@4fb57fb3]
: Removed blocks associated with storage
[DISK]DS-298de29e-9104-48dd-a674-5443a6126969:FAILED:X.X.X.X:50010:/data0/dfs
from DataNode X.X.X.X:50010
[INFO] [IPC Server handler 73 on 8021] : Removed storage
[DISK]DS-298de29e-9104-48dd-a674-5443a6126969:FAILED:X.X.X.X:50010:/data0/dfs
from DataNode X.X.X.X:50010{code}
NS2 just received the failed storage.
{code:java}
[INFO] [IPC Server handler 87 on 8021] : Number of failed storages changes from
0 to 1 {code}
After digging into the code and trying to simulate disk failed with
{code:java}
echo offline > /sys/block/sda/device/state
echo 1 > /sys/block/sda/device/delete
# re-mount the failed disk
rescan-scsi-bus.sh -a
systemctl daemon-reload
mount /data0
{code}
I found the root reason is the inconsistency between StorageReport and
VolumeFailureSummary in BPServiceActor#sendHeartBeat.
{code}
StorageReport[] reports =
dn.getFSDataset().getStorageReports(bpos.getBlockPoolId());
......
// the DISK may FAILED before executing the next line
VolumeFailureSummary volumeFailureSummary = dn.getFSDataset()
.getVolumeFailureSummary();
int numFailedVolumes = volumeFailureSummary != null ?
volumeFailureSummary.getFailedStorageLocations().length : 0;
{code}
I improved the tolerance in NN DatanodeDescriptor#updateStorageStats to solve
this issue.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]