[jira] [Comment Edited] (HDFS-15274) NN doesn't remove the blocks from the failed DatanodeStorageInfo

Shilun Fan (Jira) Thu, 04 Jan 2024 00:14:18 -0800


    [ 
https://issues.apache.org/jira/browse/HDFS-15274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17802420#comment-17802420
 ]


Shilun Fan edited comment on HDFS-15274 at 1/4/24 8:13 AM:
-----------------------------------------------------------

Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a 
blocker. Retarget 3.5.0.


was (Author: slfan1989):
updated the target version for preparing 3.4.0 release.

> NN doesn't remove the blocks from the failed DatanodeStorageInfo
> ----------------------------------------------------------------
>
>                 Key: HDFS-15274
>                 URL: https://issues.apache.org/jira/browse/HDFS-15274
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 3.4.0
>            Reporter: HuangTao
>            Assignee: HuangTao
>            Priority: Major
>         Attachments: HDFS-15274.001.patch, HDFS-15274.002.patch
>
>
> In our federation cluster, we found there were some inconsistency failure 
> volumes between two namespaces. The following logs are two NS separately.
> NS1 received the failed storage info and removed the blocks associated with 
> the failed storage.
> {code:java}
> [INFO] [IPC Server handler 76 on 8021] : Number of failed storages changes 
> from 0 to 1
> [INFO] [IPC Server handler 76 on 8021] : 
> [DISK]DS-298de29e-9104-48dd-a674-5443a6126969:NORMAL:X.X.X.X:50010:/data0/dfs 
> failed.
> [INFO] 
> [org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager$Monitor@4fb57fb3]
>  : Removed blocks associated with storage 
> [DISK]DS-298de29e-9104-48dd-a674-5443a6126969:FAILED:X.X.X.X:50010:/data0/dfs 
> from DataNode X.X.X.X:50010
> [INFO] [IPC Server handler 73 on 8021] : Removed storage 
> [DISK]DS-298de29e-9104-48dd-a674-5443a6126969:FAILED:X.X.X.X:50010:/data0/dfs 
> from DataNode X.X.X.X:50010{code}
> NS2 just received the failed storage.
> {code:java}
> [INFO] [IPC Server handler 87 on 8021] : Number of failed storages changes 
> from 0 to 1  {code}
>  
> After digging into the code and trying to simulate disk failed with
> {code:java}
> echo offline > /sys/block/sda/device/state
> echo 1 > /sys/block/sda/device/delete
> # re-mount the failed disk
> rescan-scsi-bus.sh -a
> systemctl daemon-reload
> mount /data0
> {code}
> I found the root reason is the inconsistency between StorageReport and 
> VolumeFailureSummary in BPServiceActor#sendHeartBeat.
> {code}
> StorageReport[] reports =
>         dn.getFSDataset().getStorageReports(bpos.getBlockPoolId());
>   ......
>   // the DISK may FAILED before executing the next line
>     VolumeFailureSummary volumeFailureSummary = dn.getFSDataset()
>         .getVolumeFailureSummary();
>     int numFailedVolumes = volumeFailureSummary != null ?
>         volumeFailureSummary.getFailedStorageLocations().length : 0;
> {code} 
> I improved the tolerance in NN DatanodeDescriptor#updateStorageStats to solve 
> this issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (HDFS-15274) NN doesn't remove the blocks from the failed DatanodeStorageInfo

Reply via email to