[ https://issues.apache.org/jira/browse/HDFS-15725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17248242#comment-17248242 ]
Stephen O'Donnell commented on HDFS-15725: ------------------------------------------ I have committed this to trunk and branch-3.3. Branch-3.2 had two very minor conflicts on the tests. This call (appears twice in the tests): {code} HdfsFileStatus stat = client.getNamenode() .create(file, new FsPermission("777"), client.clientName, new EnumSetWritable<CreateFlag>(EnumSet.of(CreateFlag.CREATE)), true, (short) repFactor, 1024 * 1024 * 128L, new CryptoProtocolVersion[0], null, null); {code} Has to have the final parameter removed: {code} HdfsFileStatus stat = client.getNamenode() .create(file, new FsPermission("777"), client.clientName, new EnumSetWritable<CreateFlag>(EnumSet.of(CreateFlag.CREATE)), true, (short) repFactor, 1024 * 1024 * 128L, new CryptoProtocolVersion[0], null); {code} Failing tests pass locally. The branch-3.2 patch also applies cleanly to branch-3.1, so I think we need just one more patch for 2.10. > Lease Recovery never completes for a committed block which the DNs never > finalize > --------------------------------------------------------------------------------- > > Key: HDFS-15725 > URL: https://issues.apache.org/jira/browse/HDFS-15725 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 3.4.0 > Reporter: Stephen O'Donnell > Assignee: Stephen O'Donnell > Priority: Major > Attachments: HDFS-15725.001.patch, HDFS-15725.002.patch, > HDFS-15725.003.patch, HDFS-15725.branch-3.2.001.patch, > lease_recovery_2_10.patch > > > It a very rare condition, the HDFS client process can get killed right at the > time it is completing a block / file. > The client sends the "complete" call to the namenode, moving the block into a > committed state, but it dies before it can send the final packet to the > Datanodes telling them to finalize the block. > This means the blocks are stuck on the datanodes in RBW state and nothing > will ever tell them to move out of that state. > The namenode / lease manager will retry forever to close the file, but it > will always complain it is waiting for blocks to reach minimal replication. > I have a simple test and patch to fix this, but I think it warrants some > discussion on whether this is the correct thing to do, or if I need to put > the fix behind a config switch. > My idea, is that if lease recovery occurs, and the block is still waiting on > "minimal replication", just put the file back to UNDER_CONSTRUCTION so that > on the next lease recovery attempt, BLOCK RECOVERY will happen, close the > file and move the replicas to FINALIZED. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org