[jira] [Commented] (HDFS-17342) Fix DataNode may invalidates normal block causing missing block

ASF GitHub Bot (Jira) Sun, 21 Jan 2024 23:32:04 -0800


    [ 
https://issues.apache.org/jira/browse/HDFS-17342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17809270#comment-17809270
 ]


ASF GitHub Bot commented on HDFS-17342:
---------------------------------------

haiyang1987 commented on PR #6464:
URL: https://github.com/apache/hadoop/pull/6464#issuecomment-1903406926

   > > This is a bug fix after #5564 , do you have time to help review this?
   > 
   > @smarthanwang I have a question about 
[HDFS-16985](https://issues.apache.org/jira/browse/HDFS-16985), Normally 
FileNotFoundException means that the meta file or data file maybe lost, so the 
replication on this datanode maybe corrupt, right? In your business(AWS EC2 + 
EBS) situation, you don't expect datanode to delete this replica directly, so 
[HDFS-16985](https://issues.apache.org/jira/browse/HDFS-16985) just remove the 
replica from the memory of DN.
   > 
   > But I want to see that DN should directly delete this corrupt replica If 
it can ensure that the replica is corrupt, such as: meta file or data file is 
lost. So we can add a configure to control whether DN delete this replication 
from disk directly, such as: fs.datanode.delete.corrupt.replica.from.disk with 
a default value true.
   > 
   > If `fs.datanode.delete.corrupt.replica.from.disk` is true, DN can delete 
this corrupt replica from disk directly. If 
`fs.datanode.delete.corrupt.replica.from.disk` is false, DN can just delete 
this corrupt replica from memory.
   > 
   > @smarthanwang @zhangshuyan0 looking forward to your good ideas.
   
   Thanks @ZanderXu for your comment.
   I agree with add new param to control whether this scenario requires 
deleting the replica from the disk.
   from the datanode side, if it is confirmed that the replica is not exists 
(meta file or data file is lost), it seems maybe reasonable that this replica 
should also be deleted (residual meta file or data file) from the disk.
   thanks~
   
   




> Fix DataNode may invalidates normal block causing missing block
> ---------------------------------------------------------------
>
>                 Key: HDFS-17342
>                 URL: https://issues.apache.org/jira/browse/HDFS-17342
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: datanode
>            Reporter: Haiyang Hu
>            Assignee: Haiyang Hu
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.5.0
>
>
> When users read an append file, occasional exceptions may occur, such as 
> org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: xxx.
> This can happen if one thread is reading the block while writer thread is 
> finalizing it simultaneously.
> *Root cause:*
> # The reader thread obtains a RBW replica from VolumeMap, such as: 
> blk_xxx_xxx[RBW] and  the data file should be in /XXX/rbw/blk_xxx.
> # Simultaneously, the writer thread will finalize this block, moving it from 
> the RBW directory to the FINALIZE directory. the data file is move from 
> /XXX/rbw/block_xxx to /XXX/finalize/block_xxx.
> # The reader thread attempts to open this data input stream but encounters a 
> FileNotFoundException because the data file /XXX/rbw/blk_xxx or meta file 
> /XXX/rbw/blk_xxx_xxx doesn't exist at this moment.
> # The reader thread  will treats this block as corrupt, removes the replica 
> from the volume map, and the DataNode reports the deleted block to the 
> NameNode.
> # The NameNode removes this replica for the block.
> # If the current file replication is 1, this file will cause a missing block 
> issue until this DataNode executes the DirectoryScanner again.
> As described above, when the reader thread encountered FileNotFoundException 
> is as expected, because the file is moved.
> So we need to add a double check to the invalidateMissingBlock logic to 
> verify whether the data file or meta file exists to avoid similar cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-17342) Fix DataNode may invalidates normal block causing missing block

Reply via email to