[
https://issues.apache.org/jira/browse/HDFS-6505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14026085#comment-14026085
]
Gordon Wang commented on HDFS-6505:
-----------------------------------
This issue causes the last block is missing and the file is corrupted. But
actually, the data on DataNode is correct.
I went through the code, and I think some safe check is missing when namenode
receives a bad block report from datanodes.
See the following code snippet in namenode BlockManager
{code}
public void findAndMarkBlockAsCorrupt(final ExtendedBlock blk,
final DatanodeInfo dn, String storageID, String reason) throws
IOException {
assert namesystem.hasWriteLock();
final BlockInfo storedBlock = getStoredBlock(blk.getLocalBlock());
if (storedBlock == null) {
// Check if the replica is in the blockMap, if not
// ignore the request for now. This could happen when BlockScanner
// thread of Datanode reports bad block before Block reports are sent
// by the Datanode on startup
blockLog.info("BLOCK* findAndMarkBlockAsCorrupt: "
+ blk + " not found");
return;
}
markBlockAsCorrupt(new BlockToMarkCorrupt(storedBlock, reason,
Reason.CORRUPTION_REPORTED), dn, storageID);
}
{code}
We should check the timestamp in reported block and stored block. If the
reported block has a smaller timestamp, this block should not be marked as
corrupt. It is possible that the reported block has a smaller timestamp when
client has done some work on recovering pipeline.
> Can not close file due to last block is marked as corrupt
> ---------------------------------------------------------
>
> Key: HDFS-6505
> URL: https://issues.apache.org/jira/browse/HDFS-6505
> Project: Hadoop HDFS
> Issue Type: Bug
> Affects Versions: 2.2.0
> Reporter: Gordon Wang
>
> After appending a file, client could not close it. Because namenode could not
> complete the last block in file. The UC status of last block remained as
> COMMIT and never change.
> The namenode log was like this.
> {code}
> INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK*
> checkFileProgress: blk_1073741920_13948{blockUCState=COMMITTED,
> primaryNodeIndex=-1,
> replicas=[ReplicaUnderConstruction[172.28.1.2:50010|RBW]]} has not reached
> minimal replication 1
> {code}
> After going through the log of namenode, I found a log like this
> {code}
> INFO BlockStateChange: BLOCK NameSystem.addToCorruptReplicasMap:
> blk_1073741920 added as corrupt on 172.28.1.2:50010 by sdw3/172.28.1.3
> because client machine reported it
> {code}
> But actually, the last block was finished successfully in the data node.
> Because I could find the log in datanode
> {code}
> INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DataTransfer:
> Transmitted BP-649434182-172.28.1.251-1401432753616:blk_1073741920_13808
> (numBytes=50120352) to /172.28.1.3:50010
> INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
> /172.28.1.2:36860, dest: /172.28.1.2:50010, bytes: 51686616, op: HDFS_WRITE,
> cliID:
> libhdfs3_client_random_741511239_count_1_pid_215802_tid_140085714196576,
> offset: 0, srvID: DS-2074102060-172.28.1.2-50010-1401432768690, blockid:
> BP-649434182-172.28.1.251-1401432753616:blk_1073741920_13948, duration:
> 189226453336
> INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder:
> BP-649434182-172.28.1.251-1401432753616:blk_1073741920_13948,
> type=LAST_IN_PIPELINE, downstreams=0:[] terminating
> {code}
--
This message was sent by Atlassian JIRA
(v6.2#6252)