[ https://issues.apache.org/jira/browse/HDFS-10348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Konstantin Shvachko updated HDFS-10348: --------------------------------------- Target Version/s: (was: 2.7.6) > Namenode report bad block method doesn't check whether the block belongs to > datanode before adding it to corrupt replicas map. > ------------------------------------------------------------------------------------------------------------------------------ > > Key: HDFS-10348 > URL: https://issues.apache.org/jira/browse/HDFS-10348 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 2.7.0 > Reporter: Rushabh S Shah > Assignee: Rushabh S Shah > Priority: Major > Attachments: HDFS-10348-1.patch, HDFS-10348.patch > > > Namenode (via report bad block nethod) doesn't check whether the block > belongs to the datanode before it adds to corrupt replicas map. > In one of our cluster we found that there were 3 lingering corrupt blocks. > It happened in the following order. > 1. Two clients called getBlockLocations for a particular file. > 2. Client C1 tried to open the file and encountered checksum error from > node N3 and it reported bad block (blk1) to the namenode. > 3. Namenode added that node N3 and block blk1 to corrrupt replicas map and > ask one of the good node (one of the 2 nodes) to replicate the block to > another node N4. > 4. After receiving the block, N4 sends an IBR (with RECEIVED_BLOCK) to > namenode. > 5. Namenode removed the block and node N3 from corrupt replicas map. > It also removed N3's storage from triplets and queued an invalidate > request for N3. > 6. In the mean time, Client C2 tries to open the file and the request went to > node N3. > C2 also encountered the checksum exception and reported bad block to > namenode. > 7. Namenode added the corrupt block blk1 and node N3 to the corrupt replicas > map without confirming whether node N3 has the block or not. > After deleting the block, N3 sends an IBR (with DELETED) and the namenode > simply ignores the report since the N3's storage is no longer in the > triplets(from step 5) > We took the node out of rotation, but still the block was present only in the > corruptReplciasMap. > Since on removing the node, we only goes through the block which are present > in the triplets for a given datanode. > [~kshukla]'s patch fixed this bug via > https://issues.apache.org/jira/browse/HDFS-9958. > But I think the following check should be made in the > BlockManager#markBlockAsCorrupt instead of > BlockManager#findAndMarkBlockAsCorrupt. > {noformat} > if (storage == null) { > storage = storedBlock.findStorageInfo(node); > } > if (storage == null) { > blockLog.debug("BLOCK* findAndMarkBlockAsCorrupt: {} not found on {}", > blk, dn); > return; > } > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org