[ https://issues.apache.org/jira/browse/HDFS-11960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16044755#comment-16044755 ]
Kihwal Lee commented on HDFS-11960: ----------------------------------- Details of the step 6). {{processIncrementalBlockReport()}} calls {{addBlock()}} for the received IBR with the old gen stamp. {{addBlock()}} unconditionally decrements pending count for the block. {code:java} void addBlock(DatanodeStorageInfo storageInfo, Block block, String delHint) throws IOException { ... // // Modify the blocks->datanode map and node's map. // pendingReplications.decrement(block, node); processAndHandleReportedBlock(storageInfo, block, ReplicaState.FINALIZED, delHintNode); } {code} In {{processAndHandleReportedBlock()}}, the replica is identified as corrupt, so {{markBlockAsCorrupt()}} is called. {code} private void markBlockAsCorrupt(BlockToMarkCorrupt b, DatanodeStorageInfo storageInfo, DatanodeDescriptor node) throws IOException { ... boolean corruptedDuringWrite = minReplicationSatisfied && (b.stored.getGenerationStamp() > b.corrupted.getGenerationStamp()); // case 1: have enough number of live replicas // case 2: corrupted replicas + live replicas > Replication factor // case 3: Block is marked corrupt due to failure while writing. In this // case genstamp will be different than that of valid block. // In all these cases we can delete the replica. // In case of 3, rbw block will be deleted and valid block can be replicated if (hasEnoughLiveReplicas || hasMoreCorruptReplicas || corruptedDuringWrite) { // the block is over-replicated so invalidate the replicas immediately invalidateBlock(b, node); } else if (namesystem.isPopulatingReplQueues()) { // add the block to neededReplication updateNeededReplications(b.stored, -1, 0); } } {code} As shown above, it is considered as "case 3", which causes immediate invalidation of the corrupt block. No further check on replication is done. > Successfully closed files can stay under-replicated. > ---------------------------------------------------- > > Key: HDFS-11960 > URL: https://issues.apache.org/jira/browse/HDFS-11960 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: Kihwal Lee > Assignee: Kihwal Lee > Priority: Critical > > If a certain set of conditions hold at the time of a file creation, a block > of the file can stay under-replicated. This is because the block is > mistakenly taken out of the under-replicated block queue and never gets > reevaluated. > Re-evaluation can be triggered if > - a replica containing node dies. > - setrep is called > - NN repl queues are reinitialized (NN failover or restart) > If none of these happens, the block stays under-replicated. > Here is how it happens. > 1) A replica is finalized, but the ACK does not reach the upstream in time. > IBR is also delayed. > 2) A close recovery happens, which updates the gen stamp of "healthy" > replicas. > 3) The file is closed with the healthy replicas. It is added to the > replication queue. > 4) A replication is scheduled, so it is added to the pending replication > list. The replication target is picked as the failed node in 1). > 5) The old IBR is finally received for the failed/excluded node. In the > meantime, the replication fails, because there is already a finalized replica > (with older gen stamp) on the node. > 6) The IBR processing removes the block from the pending list, adds it to > corrupt replicas list, and then issues invalidation. Since the block is in > neither replication queue nor pending list, it stays under-replicated. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org