[
https://issues.apache.org/jira/browse/HDFS-17003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17721707#comment-17721707
]
farmmamba commented on HDFS-17003:
----------------------------------
1、destroy d1, d2 manually.
2、read this ec file to trigger reconstruction soonly, After reconstructing,
new data blocks are d1', d2'.
3、d1', d2' send IBR to namenode. When namenode receives the last IBR, it will
execute invalidateCorruptReplicas method in addStoredBlock.
4、In invalidateCorruptReplicas method, it will use the blockid of last IBR to
invalidate blocks. for example if using block id of d1', then it will
send invalidate command to d1, d2 to invalidate d1. Because d2 does not
match the block id of d1 block, The corrupt d2 will not be deleted.
5、FBR on the datanode with d2, both d2 and d2' are exist. The d2' may be
deleted mistakenly.
> Erasure coding: invalidate wrong block after reporting bad blocks from
> datanode
> -------------------------------------------------------------------------------
>
> Key: HDFS-17003
> URL: https://issues.apache.org/jira/browse/HDFS-17003
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: farmmamba
> Priority: Critical
> Labels: pull-request-available
>
> After receiving reportBadBlocks RPC from datanode, NameNode compute wrong
> block to invalidate. It is a dangerous behaviour and may cause data loss.
> Some logs in our production as below:
>
> NameNode log:
> {code:java}
> 2023-05-08 21:23:49,112 INFO org.apache.hadoop.hdfs.StateChange: *DIR*
> reportBadBlocks for block:
> BP-932824627-xxxx-1680179358678:blk_-9223372036848404320_1471186 on datanode:
> datanode1:50010
> 2023-05-08 21:23:49,183 INFO org.apache.hadoop.hdfs.StateChange: *DIR*
> reportBadBlocks for block:
> BP-932824627-xxxx-1680179358678:blk_-9223372036848404319_1471186 on datanode:
> datanode2:50010{code}
> datanode1 log:
> {code:java}
> 2023-05-08 21:23:49,088 WARN
> org.apache.hadoop.hdfs.server.datanode.VolumeScanner: Reporting bad
> BP-932824627-xxxx-1680179358678:blk_-9223372036848404320_1471186 on
> /data7/hadoop/hdfs/datanode
> 2023-05-08 21:24:00,509 INFO
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Failed
> to delete replica blk_-9223372036848404319_1471186: ReplicaInfo not
> found.{code}
>
> This phenomenon can be reproduced.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]