[jira] [Comment Edited] (HDFS-12369) Edit log corruption due to hard lease recovery of not-closed file which has snapshots

2017-09-07 Thread Xiao Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16154913#comment-16154913
 ] 

Xiao Chen edited comment on HDFS-12369 at 9/7/17 10:57 PM:
---

Thanks for the review Yongjun.

bq. does this issue only occur when the file has a snapshot? 
Good question! Yes, because only with snapshots (and when the inode is in 
latest snapshot) the delete will go the path of 
{{FSDirDeleteOp#unprotectedDelete}} -> {{INodeFile#cleanSubtree}}, which 
eventually ends up not {{clearFile}} in :
{code:title=FileWithSnapshotFeature.java}
  public void collectBlocksAndClear(
  INode.ReclaimContext reclaimContext, final INodeFile file) {
// check if everything is deleted.
if (isCurrentFileDeleted() && getDiffs().asList().isEmpty()) {
  file.clearFile(reclaimContext);
  return;
}
{code}
Added a few comments in test about this, and also updated the jira title.

Good catch for the extra ';'. Attach patch 3 to address that, previous 
checkstyle, and added the {{addBlock}} in test.


was (Author: xiaochen):
Thanks for the review Yongjun.

bq. does this issue only occur when the file has a snapshot? 
Good question! Yes, because only with snapshots the delete will go the path of 
{{FSDirDeleteOp#unprotectedDelete}} -> {{INodeFile#cleanSubtree}}, which 
eventually ends up not {{clearFile}} in :
{code:title=FileWithSnapshotFeature.java}
  public void collectBlocksAndClear(
  INode.ReclaimContext reclaimContext, final INodeFile file) {
// check if everything is deleted.
if (isCurrentFileDeleted() && getDiffs().asList().isEmpty()) {
  file.clearFile(reclaimContext);
  return;
}
{code}
Added a few comments in test about this, and also updated the jira title.

Good catch for the extra ';'. Attach patch 3 to address that, previous 
checkstyle, and added the {{addBlock}} in test.

> Edit log corruption due to hard lease recovery of not-closed file which has 
> snapshots
> -
>
> Key: HDFS-12369
> URL: https://issues.apache.org/jira/browse/HDFS-12369
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Xiao Chen
>Assignee: Xiao Chen
> Attachments: HDFS-12369.01.patch, HDFS-12369.02.patch, 
> HDFS-12369.03.patch, HDFS-12369.test.patch
>
>
> HDFS-6257 and HDFS-7707 worked hard to prevent corruption from combinations 
> of client operations.
> Recently, we have observed NN not able to start with the following exception:
> {noformat}
> 2017-08-17 14:32:18,418 ERROR 
> org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode.
> java.io.FileNotFoundException: File does not exist: 
> /home/Events/CancellationSurvey_MySQL/2015/12/31/.part-0.9nlJ3M
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:429)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:232)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:141)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:897)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:750)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:318)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1125)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:789)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:614)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:676)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:844)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:823)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1547)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1615)
> {noformat}
> Quoting a nicely analysed edits:
> {quote}
> In the edits logged about 1 hour later, we see this failing OP_CLOSE. The 
> sequence in the edits shows the file going through:
>   OPEN
>   ADD_BLOCK
>   CLOSE
>   ADD_BLOCK # perhaps this was an append
>   DELETE
>   (about 1 hour later) CLOSE
> It is interesting that there was no CLOSE logged before the delete.
> {quote}
> Grepping that file name, it turns out the close was triggered by 
> {{LeaseManager}}, when the lease 

[jira] [Comment Edited] (HDFS-12369) Edit log corruption due to hard lease recovery of not-closed file

2017-08-28 Thread Xiao Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144529#comment-16144529
 ] 

Xiao Chen edited comment on HDFS-12369 at 8/28/17 11:21 PM:


Attaching a unit test that sorta reproduces this - it ends up with an NPE when 
loading {{ReassignLeaseOp}}, instead of the FNFE when loading the {{CloseOp}}.

I'm still looking into this because I thought deleting a file will always 
remove the relase, but wanted to post here for early discussions.




was (Author: xiaochen):
Attaching a unit test that sorta reproduces this - it ends up with an NPE when 
loading {{ReassignLeaseOp}}, instead of the FNFE when loading the {{CloseOp}}.

I'm still looking into this, but wanted to post here for early discussions.



> Edit log corruption due to hard lease recovery of not-closed file
> -
>
> Key: HDFS-12369
> URL: https://issues.apache.org/jira/browse/HDFS-12369
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Xiao Chen
>Assignee: Xiao Chen
> Attachments: HDFS-12369.test.patch
>
>
> HDFS-6257 and HDFS-7707 worked hard to prevent corruption from combinations 
> of client operations.
> Recently, we have observed NN not able to start with the following exception:
> {noformat}
> 2017-08-17 14:32:18,418 ERROR 
> org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode.
> java.io.FileNotFoundException: File does not exist: 
> /home/Events/CancellationSurvey_MySQL/2015/12/31/.part-0.9nlJ3M
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:429)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:232)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:141)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:897)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:750)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:318)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1125)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:789)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:614)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:676)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:844)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:823)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1547)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1615)
> {noformat}
> Quoting a nicely analysed edits:
> {quote}
> In the edits logged about 1 hour later, we see this failing OP_CLOSE. The 
> sequence in the edits shows the file going through:
>   OPEN
>   ADD_BLOCK
>   CLOSE
>   ADD_BLOCK # perhaps this was an append
>   DELETE
>   (about 1 hour later) CLOSE
> It is interesting that there was no CLOSE logged before the delete.
> {quote}
> Grepping that file name, it turns out the close was triggered by 
> {{LeaseManager}}, when the lease reaches hard limit.
> {noformat}
> 2017-08-16 15:05:45,927 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 
>   Recovering [Lease.  Holder: DFSClient_NONMAPREDUCE_-1997177597_28, pending 
> creates: 75], 
>   src=/home/Events/CancellationSurvey_MySQL/2015/12/31/.part-0.9nlJ3M
> 2017-08-16 15:05:45,927 WARN org.apache.hadoop.hdfs.StateChange: BLOCK* 
>   internalReleaseLease: All existing blocks are COMPLETE, lease removed, file 
>   /home/Events/CancellationSurvey_MySQL/2015/12/31/.part-0.9nlJ3M closed.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org