Rushabh S Shah created HDFS-10348:
-------------------------------------
Summary: Namenode report bad block method doesn't check whether
the block belongs to datanode before adding it to corrupt replicas map.
Key: HDFS-10348
URL: https://issues.apache.org/jira/browse/HDFS-10348
Project: Hadoop HDFS
Issue Type: Bug
Components: namenode
Affects Versions: 2.7.0
Reporter: Rushabh S Shah
Assignee: Rushabh S Shah
Namenode (via report bad block nethod) doesn't check whether the block belongs
to the datanode before it adds to corrupt replicas map.
In one of our cluster we found that there were 3 lingering corrupt blocks.
It happened in the following order.
1. Two clients called getBlockLocations for a particular file.
2. Client C1 tried to open the file and encountered checksum error from node
N3 and it reported bad block (blk1) to the namenode.
3. Namenode added that node N3 and block blk1 to corrrupt replicas map and
ask one of the good node (one of the 2 nodes) to replicate the block to another
node N4.
4. After receiving the block, N4 sends an IBR (with RECEIVED_BLOCK) to namenode.
5. Namenode removed the block and node N3 from corrupt replicas map.
It also removed N3's storage from triplets and queued an invalidate request
for N3.
6. In the mean time, Client C2 tries to open the file and the request went to
node N3.
C2 also encountered the checksum exception and reported bad block to
namenode.
7. Namenode added the corrupt block blk1 and node N3 to the corrupt replicas
map without confirming whether node N3 has the block or not.
After deleting the block, N3 sends an IBR (with DELETED) and the namenode
simply ignores the report since the N3's storage is no longer in the
triplets(from step 5)
We took the node out of rotation, but still the block was present only in the
corruptReplciasMap.
Since on removing the node, we only goes through the block which are present in
the triplets for a given datanode.
Kuhu's patch fixed this bug via https://issues.apache.org/jira/browse/HDFS-9958.
But I think the following check should be made in the
BlockManager#markBlockAsCorrupt instead of
BlockManager#findAndMarkBlockAsCorrupt.
{noformat}
if (storage == null) {
storage = storedBlock.findStorageInfo(node);
}
if (storage == null) {
blockLog.debug("BLOCK* findAndMarkBlockAsCorrupt: {} not found on {}",
blk, dn);
return;
}
{noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]