[
https://issues.apache.org/jira/browse/HDFS-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905950#action_12905950
]
Tsz Wo (Nicholas), SZE commented on HDFS-1371:
----------------------------------------------
I have checked the code and discussed it with Koji.
(a) When DFSClient detects a corrupt replica, it reports to NN. Then, NN will
blindly mark the replica as corrupted in the BlocksMap.
(b) When NN receives a ClientProtocol.getBlockLocations(..) rpc call, it gets
all the replicas from the BlocksMap. If there are one or more good replicas,
NN returns the good replicas only. If all replicas are corrupted, it returns
all (corrupted) replicas and set LocatedBlock.corrupt = true.
(c) When DFSClient gets a LocatedBlock from NN, it does not care whether
LocatedBlock.corrupt is true or false.
The flaws are in (a) and (c). The problem here is that if the DFSClient in (a)
is bad (e.g. bad machine), NN may incorrectly mark the replicas as corrupted.
Then, when another DFSClient tries to read the block, it receives a
LocatedBlock with LocatedBlock.corrupt = true but it still keeps using them
because of (c). Luckily, the double negative cancels out, therefore, the read
successes. However, the NN BlocksMap information is incorrect and will not be
fixed until NN restarts.
> One bad node can incorrectly flag many files as corrupt
> -------------------------------------------------------
>
> Key: HDFS-1371
> URL: https://issues.apache.org/jira/browse/HDFS-1371
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: hdfs client, name-node
> Affects Versions: 0.20.1
> Environment: yahoo internal version
> [knogu...@gwgd4003 ~]$ hadoop version
> Hadoop 0.20.104.3.1007030707
> Reporter: Koji Noguchi
>
> On our cluster, 12 files were reported as corrupt by fsck even though the
> replicas on the datanodes were healthy.
> Turns out that all the replicas (12 files x 3 replicas per file) were
> reported corrupt from one node.
> Surprisingly, these files were still readable/accessible from dfsclient
> (-get/-cat) without any problems.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.