[ https://issues.apache.org/jira/browse/HDFS-12369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xiao Chen updated HDFS-12369: ----------------------------- Summary: Edit log corruption due to hard lease recovery of not-closed file which has snapshots (was: Edit log corruption due to hard lease recovery of not-closed file) > Edit log corruption due to hard lease recovery of not-closed file which has > snapshots > ------------------------------------------------------------------------------------- > > Key: HDFS-12369 > URL: https://issues.apache.org/jira/browse/HDFS-12369 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Reporter: Xiao Chen > Assignee: Xiao Chen > Attachments: HDFS-12369.01.patch, HDFS-12369.02.patch, > HDFS-12369.03.patch, HDFS-12369.test.patch > > > HDFS-6257 and HDFS-7707 worked hard to prevent corruption from combinations > of client operations. > Recently, we have observed NN not able to start with the following exception: > {noformat} > 2017-08-17 14:32:18,418 ERROR > org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode. > java.io.FileNotFoundException: File does not exist: > /home/Events/CancellationSurvey_MySQL/2015/12/31/.part-00000.9nlJ3M > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66) > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:429) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:232) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:141) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:897) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:750) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:318) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1125) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:789) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:614) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:676) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:844) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:823) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1547) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1615) > {noformat} > Quoting a nicely analysed edits: > {quote} > In the edits logged about 1 hour later, we see this failing OP_CLOSE. The > sequence in the edits shows the file going through: > OPEN > ADD_BLOCK > CLOSE > ADD_BLOCK # perhaps this was an append > DELETE > (about 1 hour later) CLOSE > It is interesting that there was no CLOSE logged before the delete. > {quote} > Grepping that file name, it turns out the close was triggered by > {{LeaseManager}}, when the lease reaches hard limit. > {noformat} > 2017-08-16 15:05:45,927 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: > Recovering [Lease. Holder: DFSClient_NONMAPREDUCE_-1997177597_28, pending > creates: 75], > src=/home/Events/CancellationSurvey_MySQL/2015/12/31/.part-00000.9nlJ3M > 2017-08-16 15:05:45,927 WARN org.apache.hadoop.hdfs.StateChange: BLOCK* > internalReleaseLease: All existing blocks are COMPLETE, lease removed, file > /home/Events/CancellationSurvey_MySQL/2015/12/31/.part-00000.9nlJ3M closed. > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org