[ https://issues.apache.org/jira/browse/HDFS-15725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17247796#comment-17247796 ]
Tsz-wo Sze commented on HDFS-15725: ----------------------------------- {code} - ". Committed blocks are waiting to be minimally replicated." + - " Try again later."; + ". Committed blocks are waiting to be minimally replicated."; NameNode.stateChangeLog.warn(message); - throw new AlreadyBeingCreatedException(message); + if (!penultimateBlockMinStorage) { + throw new AlreadyBeingCreatedException(message); + } + // Intentionally fall through to UNDER_RECOVERY so BLOCK_RECOVERY is + // attempted {code} Would it be better to try a few times before throwing AlreadyBeingCreatedException? The datanodes may just be slow so that they have not yet reported the last block. > Lease Recovery never completes for a committed block which the DNs never > finalize > --------------------------------------------------------------------------------- > > Key: HDFS-15725 > URL: https://issues.apache.org/jira/browse/HDFS-15725 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 3.4.0 > Reporter: Stephen O'Donnell > Assignee: Stephen O'Donnell > Priority: Major > Attachments: HDFS-15725.001.patch, HDFS-15725.002.patch, > HDFS-15725.003.patch, lease_recovery_2_10.patch > > > It a very rare condition, the HDFS client process can get killed right at the > time it is completing a block / file. > The client sends the "complete" call to the namenode, moving the block into a > committed state, but it dies before it can send the final packet to the > Datanodes telling them to finalize the block. > This means the blocks are stuck on the datanodes in RBW state and nothing > will ever tell them to move out of that state. > The namenode / lease manager will retry forever to close the file, but it > will always complain it is waiting for blocks to reach minimal replication. > I have a simple test and patch to fix this, but I think it warrants some > discussion on whether this is the correct thing to do, or if I need to put > the fix behind a config switch. > My idea, is that if lease recovery occurs, and the block is still waiting on > "minimal replication", just put the file back to UNDER_CONSTRUCTION so that > on the next lease recovery attempt, BLOCK RECOVERY will happen, close the > file and move the replicas to FINALIZED. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org