[jira] [Commented] (HDFS-6908) incorrect snapshot directory diff generated by snapshot deletion
[ https://issues.apache.org/jira/browse/HDFS-6908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14310628#comment-14310628 ] Abhishek Rai commented on HDFS-6908: Thanks Harsh, that sounds reasonable. It gives us a way to avoid having to live with the FSImageFormatPBSnapshot hack longer term once the proper fix for this bug is applied. Thanks incorrect snapshot directory diff generated by snapshot deletion Key: HDFS-6908 URL: https://issues.apache.org/jira/browse/HDFS-6908 Project: Hadoop HDFS Issue Type: Bug Components: snapshots Reporter: Juan Yu Assignee: Juan Yu Priority: Critical Fix For: 2.6.0 Attachments: HDFS-6908.001.patch, HDFS-6908.002.patch, HDFS-6908.003.patch In the following scenario, delete snapshot could generate incorrect snapshot directory diff and corrupted fsimage, if you restart NN after that, you will get NullPointerException. 1. create a directory and create a file under it 2. take a snapshot 3. create another file under that directory 4. take second snapshot 5. delete both files and the directory 6. delete second snapshot incorrect directory diff will be generated. Restart NN will throw NPE {code} java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.addToDeletedList(FSImageFormatPBSnapshot.java:246) at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadDeletedList(FSImageFormatPBSnapshot.java:265) at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadDirectoryDiffList(FSImageFormatPBSnapshot.java:328) at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadSnapshotDiffSection(FSImageFormatPBSnapshot.java:192) at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:254) at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:168) at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:208) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:906) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:892) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImageFile(FSImage.java:715) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:653) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:276) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:882) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:629) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:498) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:554) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6908) incorrect snapshot directory diff generated by snapshot deletion
[ https://issues.apache.org/jira/browse/HDFS-6908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14310489#comment-14310489 ] Harsh J commented on HDFS-6908: --- bq. and since it doesn't fix the corrupt fsimage on disk, the hack will always be needed to make the namenode work. When the NN comes up (ensure you also have this bugfix patched on), you can then invoke a {{dfsadmin -saveNamespace}} to recreate a new, good image. incorrect snapshot directory diff generated by snapshot deletion Key: HDFS-6908 URL: https://issues.apache.org/jira/browse/HDFS-6908 Project: Hadoop HDFS Issue Type: Bug Components: snapshots Reporter: Juan Yu Assignee: Juan Yu Priority: Critical Fix For: 2.6.0 Attachments: HDFS-6908.001.patch, HDFS-6908.002.patch, HDFS-6908.003.patch In the following scenario, delete snapshot could generate incorrect snapshot directory diff and corrupted fsimage, if you restart NN after that, you will get NullPointerException. 1. create a directory and create a file under it 2. take a snapshot 3. create another file under that directory 4. take second snapshot 5. delete both files and the directory 6. delete second snapshot incorrect directory diff will be generated. Restart NN will throw NPE {code} java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.addToDeletedList(FSImageFormatPBSnapshot.java:246) at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadDeletedList(FSImageFormatPBSnapshot.java:265) at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadDirectoryDiffList(FSImageFormatPBSnapshot.java:328) at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadSnapshotDiffSection(FSImageFormatPBSnapshot.java:192) at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:254) at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:168) at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:208) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:906) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:892) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImageFile(FSImage.java:715) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:653) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:276) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:882) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:629) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:498) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:554) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6908) incorrect snapshot directory diff generated by snapshot deletion
[ https://issues.apache.org/jira/browse/HDFS-6908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14310377#comment-14310377 ] Abhishek Rai commented on HDFS-6908: We have an HDFS installation in production where we ran into this problem. Since the fsimage is corrupt, namenode fails to come up, leaving the system unusable. We suspect that the problem was triggered in our case by deletion of one of the existing snapshots of a large directory containing several sub-directories and files. While the proposed fix above is definitely useful to prevent this issue going forward, is there any recommendation for how to fix an fsimage which was already corrupted by this bug? We temporarily put in the following hack in FSImageFormatPBSnapshot.java to mask this problem. The risk with this hack is that it can mask other bugs/corruptions, and since it doesn't fix the corrupt fsimage on disk, the hack will always be needed to make the namenode work. {noformat} +++ b/hadoop/hadoop-2.5.1-src/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/snapshot/FSImageFormatPBSnapshot.java @@ -34,6 +34,8 @@ import java.util.List; import java.util.Map; +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; import org.apache.hadoop.classification.InterfaceAudience; import org.apache.hadoop.fs.permission.PermissionStatus; import org.apache.hadoop.hdfs.server.namenode.AclFeature; @@ -73,6 +75,9 @@ @InterfaceAudience.Private public class FSImageFormatPBSnapshot { + public static final Log LOG = LogFactory.getLog(FSImageFormatPBSnapshot.class); + + /** * Loading snapshot related information from protobuf based FSImage */ @@ -267,8 +272,12 @@ private void addToDeletedList(INode dnode, INodeDirectory parent) { // load non-reference inodes for (long deletedId : deletedNodes) { INode deleted = fsDir.getInode(deletedId); -dlist.add(deleted); -addToDeletedList(deleted, dir); +if (deleted != null) { + dlist.add(deleted); + addToDeletedList(deleted, dir); +} else { + LOG.error(Could not find inode + deletedId + from deleted-list of directory: + dir.toDetailString()); +} {noformat} incorrect snapshot directory diff generated by snapshot deletion Key: HDFS-6908 URL: https://issues.apache.org/jira/browse/HDFS-6908 Project: Hadoop HDFS Issue Type: Bug Components: snapshots Reporter: Juan Yu Assignee: Juan Yu Priority: Critical Fix For: 2.6.0 Attachments: HDFS-6908.001.patch, HDFS-6908.002.patch, HDFS-6908.003.patch In the following scenario, delete snapshot could generate incorrect snapshot directory diff and corrupted fsimage, if you restart NN after that, you will get NullPointerException. 1. create a directory and create a file under it 2. take a snapshot 3. create another file under that directory 4. take second snapshot 5. delete both files and the directory 6. delete second snapshot incorrect directory diff will be generated. Restart NN will throw NPE {code} java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.addToDeletedList(FSImageFormatPBSnapshot.java:246) at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadDeletedList(FSImageFormatPBSnapshot.java:265) at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadDirectoryDiffList(FSImageFormatPBSnapshot.java:328) at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadSnapshotDiffSection(FSImageFormatPBSnapshot.java:192) at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:254) at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:168) at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:208) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:906) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:892) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImageFile(FSImage.java:715) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:653) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:276) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:882) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:629) at
[jira] [Commented] (HDFS-6908) incorrect snapshot directory diff generated by snapshot deletion
[ https://issues.apache.org/jira/browse/HDFS-6908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110927#comment-14110927 ] Juan Yu commented on HDFS-6908: --- [~jingzhao]] Thanks for reviewing patch and the discussion. incorrect snapshot directory diff generated by snapshot deletion Key: HDFS-6908 URL: https://issues.apache.org/jira/browse/HDFS-6908 Project: Hadoop HDFS Issue Type: Bug Components: snapshots Reporter: Juan Yu Assignee: Juan Yu Priority: Critical Attachments: HDFS-6908.001.patch, HDFS-6908.002.patch, HDFS-6908.003.patch In the following scenario, delete snapshot could generate incorrect snapshot directory diff and corrupted fsimage, if you restart NN after that, you will get NullPointerException. 1. create a directory and create a file under it 2. take a snapshot 3. create another file under that directory 4. take second snapshot 5. delete both files and the directory 6. delete second snapshot incorrect directory diff will be generated. Restart NN will throw NPE {code} java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.addToDeletedList(FSImageFormatPBSnapshot.java:246) at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadDeletedList(FSImageFormatPBSnapshot.java:265) at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadDirectoryDiffList(FSImageFormatPBSnapshot.java:328) at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadSnapshotDiffSection(FSImageFormatPBSnapshot.java:192) at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:254) at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:168) at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:208) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:906) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:892) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImageFile(FSImage.java:715) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:653) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:276) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:882) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:629) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:498) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:554) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6908) incorrect snapshot directory diff generated by snapshot deletion
[ https://issues.apache.org/jira/browse/HDFS-6908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109386#comment-14109386 ] Jing Zhao commented on HDFS-6908: - +1 for the latest patch. Thanks for the fix, [~j...@cloudera.com]! I will commit it later today. incorrect snapshot directory diff generated by snapshot deletion Key: HDFS-6908 URL: https://issues.apache.org/jira/browse/HDFS-6908 Project: Hadoop HDFS Issue Type: Bug Components: snapshots Reporter: Juan Yu Assignee: Juan Yu Priority: Critical Attachments: HDFS-6908.001.patch, HDFS-6908.002.patch, HDFS-6908.003.patch In the following scenario, delete snapshot could generate incorrect snapshot directory diff and corrupted fsimage, if you restart NN after that, you will get NullPointerException. 1. create a directory and create a file under it 2. take a snapshot 3. create another file under that directory 4. take second snapshot 5. delete both files and the directory 6. delete second snapshot incorrect directory diff will be generated. Restart NN will throw NPE {code} java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.addToDeletedList(FSImageFormatPBSnapshot.java:246) at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadDeletedList(FSImageFormatPBSnapshot.java:265) at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadDirectoryDiffList(FSImageFormatPBSnapshot.java:328) at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadSnapshotDiffSection(FSImageFormatPBSnapshot.java:192) at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:254) at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:168) at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:208) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:906) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:892) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImageFile(FSImage.java:715) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:653) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:276) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:882) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:629) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:498) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:554) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6908) incorrect snapshot directory diff generated by snapshot deletion
[ https://issues.apache.org/jira/browse/HDFS-6908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14108592#comment-14108592 ] Hadoop QA commented on HDFS-6908: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12663979/HDFS-6908.003.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestFileAppend org.apache.hadoop.hdfs.server.blockmanagement.TestBlockTokenWithDFS org.apache.hadoop.hdfs.web.TestWebHdfsFileSystemContract {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/7744//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7744//console This message is automatically generated. incorrect snapshot directory diff generated by snapshot deletion Key: HDFS-6908 URL: https://issues.apache.org/jira/browse/HDFS-6908 Project: Hadoop HDFS Issue Type: Bug Components: snapshots Reporter: Juan Yu Assignee: Juan Yu Priority: Critical Attachments: HDFS-6908.001.patch, HDFS-6908.002.patch, HDFS-6908.003.patch In the following scenario, delete snapshot could generate incorrect snapshot directory diff and corrupted fsimage, if you restart NN after that, you will get NullPointerException. 1. create a directory and create a file under it 2. take a snapshot 3. create another file under that directory 4. take second snapshot 5. delete both files and the directory 6. delete second snapshot incorrect directory diff will be generated. Restart NN will throw NPE {code} java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.addToDeletedList(FSImageFormatPBSnapshot.java:246) at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadDeletedList(FSImageFormatPBSnapshot.java:265) at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadDirectoryDiffList(FSImageFormatPBSnapshot.java:328) at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadSnapshotDiffSection(FSImageFormatPBSnapshot.java:192) at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:254) at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:168) at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:208) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:906) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:892) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImageFile(FSImage.java:715) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:653) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:276) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:882) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:629) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:498) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:554) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6908) incorrect snapshot directory diff generated by snapshot deletion
[ https://issues.apache.org/jira/browse/HDFS-6908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14108735#comment-14108735 ] Juan Yu commented on HDFS-6908: --- I don't think the failed tests are related to this patch. incorrect snapshot directory diff generated by snapshot deletion Key: HDFS-6908 URL: https://issues.apache.org/jira/browse/HDFS-6908 Project: Hadoop HDFS Issue Type: Bug Components: snapshots Reporter: Juan Yu Assignee: Juan Yu Priority: Critical Attachments: HDFS-6908.001.patch, HDFS-6908.002.patch, HDFS-6908.003.patch In the following scenario, delete snapshot could generate incorrect snapshot directory diff and corrupted fsimage, if you restart NN after that, you will get NullPointerException. 1. create a directory and create a file under it 2. take a snapshot 3. create another file under that directory 4. take second snapshot 5. delete both files and the directory 6. delete second snapshot incorrect directory diff will be generated. Restart NN will throw NPE {code} java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.addToDeletedList(FSImageFormatPBSnapshot.java:246) at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadDeletedList(FSImageFormatPBSnapshot.java:265) at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadDirectoryDiffList(FSImageFormatPBSnapshot.java:328) at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadSnapshotDiffSection(FSImageFormatPBSnapshot.java:192) at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:254) at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:168) at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:208) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:906) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:892) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImageFile(FSImage.java:715) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:653) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:276) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:882) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:629) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:498) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:554) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6908) incorrect snapshot directory diff generated by snapshot deletion
[ https://issues.apache.org/jira/browse/HDFS-6908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107908#comment-14107908 ] Hadoop QA commented on HDFS-6908: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12663603/HDFS-6908.002.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestEncryptionZones org.apache.hadoop.hdfs.TestDistributedFileSystem org.apache.hadoop.hdfs.TestFileAppend4 {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/7735//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7735//console This message is automatically generated. incorrect snapshot directory diff generated by snapshot deletion Key: HDFS-6908 URL: https://issues.apache.org/jira/browse/HDFS-6908 Project: Hadoop HDFS Issue Type: Bug Components: snapshots Reporter: Juan Yu Assignee: Juan Yu Priority: Critical Attachments: HDFS-6908.001.patch, HDFS-6908.002.patch In the following scenario, delete snapshot could generate incorrect snapshot directory diff and corrupted fsimage, if you restart NN after that, you will get NullPointerException. 1. create a directory and create a file under it 2. take a snapshot 3. create another file under that directory 4. take second snapshot 5. delete both files and the directory 6. delete second snapshot incorrect directory diff will be generated. Restart NN will throw NPE {code} java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.addToDeletedList(FSImageFormatPBSnapshot.java:246) at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadDeletedList(FSImageFormatPBSnapshot.java:265) at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadDirectoryDiffList(FSImageFormatPBSnapshot.java:328) at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadSnapshotDiffSection(FSImageFormatPBSnapshot.java:192) at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:254) at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:168) at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:208) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:906) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:892) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImageFile(FSImage.java:715) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:653) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:276) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:882) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:629) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:498) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:554) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6908) incorrect snapshot directory diff generated by snapshot deletion
[ https://issues.apache.org/jira/browse/HDFS-6908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14108096#comment-14108096 ] Jing Zhao commented on HDFS-6908: - Actually when deleting a directory from the current namespace, it may be better to call {{destroyCreated}} before calling {{cleanSubtreeRecursively}}. This is because: # The current bug only exists in snapshot deletion scenario (when deleting a directory, there is no snapshot combination logic involved). # {{cleanSubtreeRecursively}} goes through the complete children list of the current directory, where the children that are contained in the created list anyway should be completely removed. We can avoid processing these files/directories in {{cleanSubtreeRecursively}} if we call {{destroyCreated}} first. incorrect snapshot directory diff generated by snapshot deletion Key: HDFS-6908 URL: https://issues.apache.org/jira/browse/HDFS-6908 Project: Hadoop HDFS Issue Type: Bug Components: snapshots Reporter: Juan Yu Assignee: Juan Yu Priority: Critical Attachments: HDFS-6908.001.patch, HDFS-6908.002.patch In the following scenario, delete snapshot could generate incorrect snapshot directory diff and corrupted fsimage, if you restart NN after that, you will get NullPointerException. 1. create a directory and create a file under it 2. take a snapshot 3. create another file under that directory 4. take second snapshot 5. delete both files and the directory 6. delete second snapshot incorrect directory diff will be generated. Restart NN will throw NPE {code} java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.addToDeletedList(FSImageFormatPBSnapshot.java:246) at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadDeletedList(FSImageFormatPBSnapshot.java:265) at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadDirectoryDiffList(FSImageFormatPBSnapshot.java:328) at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadSnapshotDiffSection(FSImageFormatPBSnapshot.java:192) at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:254) at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:168) at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:208) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:906) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:892) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImageFile(FSImage.java:715) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:653) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:276) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:882) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:629) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:498) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:554) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6908) incorrect snapshot directory diff generated by snapshot deletion
[ https://issues.apache.org/jira/browse/HDFS-6908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106516#comment-14106516 ] Jing Zhao commented on HDFS-6908: - Yeah, I think that is necessary when deleting a snapshot. But when deleting a dir/file from the current fsdir, I guess it should be ok to place {{cleanSubtreeRecursively}} in the end. incorrect snapshot directory diff generated by snapshot deletion Key: HDFS-6908 URL: https://issues.apache.org/jira/browse/HDFS-6908 Project: Hadoop HDFS Issue Type: Bug Components: snapshots Reporter: Juan Yu Assignee: Juan Yu Priority: Critical Attachments: HDFS-6908.001.patch In the following scenario, delete snapshot could generate incorrect snapshot directory diff and corrupted fsimage, if you restart NN after that, you will get NullPointerException. 1. create a directory and create a file under it 2. take a snapshot 3. create another file under that directory 4. take second snapshot 5. delete both files and the directory 6. delete second snapshot incorrect directory diff will be generated. Restart NN will throw NPE {code} java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.addToDeletedList(FSImageFormatPBSnapshot.java:246) at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadDeletedList(FSImageFormatPBSnapshot.java:265) at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadDirectoryDiffList(FSImageFormatPBSnapshot.java:328) at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadSnapshotDiffSection(FSImageFormatPBSnapshot.java:192) at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:254) at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:168) at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:208) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:906) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:892) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImageFile(FSImage.java:715) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:653) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:276) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:882) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:629) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:498) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:554) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6908) incorrect snapshot directory diff generated by snapshot deletion
[ https://issues.apache.org/jira/browse/HDFS-6908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107809#comment-14107809 ] Hadoop QA commented on HDFS-6908: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12663603/HDFS-6908.002.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.security.TestRefreshUserMappings org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover org.apache.hadoop.hdfs.server.blockmanagement.TestBlockTokenWithDFS {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/7732//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7732//console This message is automatically generated. incorrect snapshot directory diff generated by snapshot deletion Key: HDFS-6908 URL: https://issues.apache.org/jira/browse/HDFS-6908 Project: Hadoop HDFS Issue Type: Bug Components: snapshots Reporter: Juan Yu Assignee: Juan Yu Priority: Critical Attachments: HDFS-6908.001.patch, HDFS-6908.002.patch In the following scenario, delete snapshot could generate incorrect snapshot directory diff and corrupted fsimage, if you restart NN after that, you will get NullPointerException. 1. create a directory and create a file under it 2. take a snapshot 3. create another file under that directory 4. take second snapshot 5. delete both files and the directory 6. delete second snapshot incorrect directory diff will be generated. Restart NN will throw NPE {code} java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.addToDeletedList(FSImageFormatPBSnapshot.java:246) at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadDeletedList(FSImageFormatPBSnapshot.java:265) at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadDirectoryDiffList(FSImageFormatPBSnapshot.java:328) at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadSnapshotDiffSection(FSImageFormatPBSnapshot.java:192) at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:254) at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:168) at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:208) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:906) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:892) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImageFile(FSImage.java:715) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:653) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:276) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:882) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:629) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:498) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:554) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6908) incorrect snapshot directory diff generated by snapshot deletion
[ https://issues.apache.org/jira/browse/HDFS-6908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106127#comment-14106127 ] Jing Zhao commented on HDFS-6908: - Thanks for working on this, [~j...@cloudera.com]! Actually this is a case the current code fails to cover. Your analysis makes sense to me. However, for the fix, if we only call dir.removeChild, the inodes that were created between prior snapshot and the deleting one will still be kept in the created list, thus can cause leaking. Maybe a better way to fix is to call {{cleanSubtreeRecursively}} before {{cleanDeletedINode}}: {code} diff --git a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/snapshot/DirectoryWithSnapshotFeature.java b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hado index 9893bba..a4f69f0 100644 --- a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/snapshot/DirectoryWithSnapshotFeature.java +++ b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/snapshot/DirectoryWithSnapshotFeature.java @@ -722,6 +722,8 @@ boolean computeDiffBetweenSnapshots(Snapshot fromSnapshot, counts.add(lastDiff.diff.destroyCreatedList(currentINode, collectedBlocks, removedINodes)); } + counts.add(currentINode.cleanSubtreeRecursively(snapshot, prior, + collectedBlocks, removedINodes, priorDeleted, countDiffChange)); } else { // update prior prior = getDiffs().updatePrior(snapshot, prior); @@ -739,7 +741,10 @@ boolean computeDiffBetweenSnapshots(Snapshot fromSnapshot, counts.add(getDiffs().deleteSnapshotDiff(snapshot, prior, currentINode, collectedBlocks, removedINodes, countDiffChange)); - + + counts.add(currentINode.cleanSubtreeRecursively(snapshot, prior, + collectedBlocks, removedINodes, priorDeleted, countDiffChange)); + // check priorDiff again since it may be created during the diff deletion if (prior != Snapshot.NO_SNAPSHOT_ID) { DirectoryDiff priorDiff = this.getDiffs().getDiffById(prior); @@ -778,9 +783,7 @@ boolean computeDiffBetweenSnapshots(Snapshot fromSnapshot, } } } -counts.add(currentINode.cleanSubtreeRecursively(snapshot, prior, -collectedBlocks, removedINodes, priorDeleted, countDiffChange)); - + if (currentINode.isQuotaSet()) { currentINode.getDirectoryWithQuotaFeature().addSpaceConsumed2Cache( -counts.get(Quota.NAMESPACE), -counts.get(Quota.DISKSPACE)); {code} incorrect snapshot directory diff generated by snapshot deletion Key: HDFS-6908 URL: https://issues.apache.org/jira/browse/HDFS-6908 Project: Hadoop HDFS Issue Type: Bug Reporter: Juan Yu Assignee: Juan Yu Attachments: HDFS-6908.001.patch In the following scenario, delete snapshot could generate incorrect snapshot directory diff and corrupted fsimage, if you restart NN after that, you will get NullPointerException. 1. create a directory and create a file under it 2. take a snapshot 3. create another file under that directory 4. take second snapshot 5. delete both files and the directory 6. delete second snapshot incorrect directory diff will be generated. Restart NN will throw NPE {code} java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.addToDeletedList(FSImageFormatPBSnapshot.java:246) at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadDeletedList(FSImageFormatPBSnapshot.java:265) at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadDirectoryDiffList(FSImageFormatPBSnapshot.java:328) at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadSnapshotDiffSection(FSImageFormatPBSnapshot.java:192) at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:254) at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:168) at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:208) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:906) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:892) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImageFile(FSImage.java:715) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:653) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:276) at
[jira] [Commented] (HDFS-6908) incorrect snapshot directory diff generated by snapshot deletion
[ https://issues.apache.org/jira/browse/HDFS-6908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106133#comment-14106133 ] Jing Zhao commented on HDFS-6908: - For the current patch, another comment is that we can move the new unit test to TestSnapshotDeletion.java, and call {{hdfs.delete(file1, true);}} instead of {{hdfs.delete(file1);}}. incorrect snapshot directory diff generated by snapshot deletion Key: HDFS-6908 URL: https://issues.apache.org/jira/browse/HDFS-6908 Project: Hadoop HDFS Issue Type: Bug Reporter: Juan Yu Assignee: Juan Yu Attachments: HDFS-6908.001.patch In the following scenario, delete snapshot could generate incorrect snapshot directory diff and corrupted fsimage, if you restart NN after that, you will get NullPointerException. 1. create a directory and create a file under it 2. take a snapshot 3. create another file under that directory 4. take second snapshot 5. delete both files and the directory 6. delete second snapshot incorrect directory diff will be generated. Restart NN will throw NPE {code} java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.addToDeletedList(FSImageFormatPBSnapshot.java:246) at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadDeletedList(FSImageFormatPBSnapshot.java:265) at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadDirectoryDiffList(FSImageFormatPBSnapshot.java:328) at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadSnapshotDiffSection(FSImageFormatPBSnapshot.java:192) at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:254) at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:168) at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:208) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:906) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:892) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImageFile(FSImage.java:715) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:653) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:276) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:882) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:629) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:498) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:554) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6908) incorrect snapshot directory diff generated by snapshot deletion
[ https://issues.apache.org/jira/browse/HDFS-6908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106231#comment-14106231 ] Juan Yu commented on HDFS-6908: --- Thanks [~jingzhao]. because the directory is deleted, it means the file created between prior snapshot and the deleting one must be deleted as well. so there are create/delete pair operations for those files. the file diff processing part will add the file to removedINodes list. when I debug the fix, I saw the inode for the file are deleted correctly, no leak. and the intermediate create/delete file change is cleaned after combining the diff with prior one as well. {code} } else if (topNode.isFile() topNode.asFile().isWithSnapshot()) { INodeFile file = topNode.asFile(); counts.add(file.getDiffs().deleteSnapshotDiff(post, prior, file, collectedBlocks, removedINodes, countDiffChange)); {code} incorrect snapshot directory diff generated by snapshot deletion Key: HDFS-6908 URL: https://issues.apache.org/jira/browse/HDFS-6908 Project: Hadoop HDFS Issue Type: Bug Components: snapshots Reporter: Juan Yu Assignee: Juan Yu Priority: Critical Attachments: HDFS-6908.001.patch In the following scenario, delete snapshot could generate incorrect snapshot directory diff and corrupted fsimage, if you restart NN after that, you will get NullPointerException. 1. create a directory and create a file under it 2. take a snapshot 3. create another file under that directory 4. take second snapshot 5. delete both files and the directory 6. delete second snapshot incorrect directory diff will be generated. Restart NN will throw NPE {code} java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.addToDeletedList(FSImageFormatPBSnapshot.java:246) at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadDeletedList(FSImageFormatPBSnapshot.java:265) at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadDirectoryDiffList(FSImageFormatPBSnapshot.java:328) at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadSnapshotDiffSection(FSImageFormatPBSnapshot.java:192) at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:254) at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:168) at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:208) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:906) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:892) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImageFile(FSImage.java:715) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:653) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:276) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:882) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:629) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:498) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:554) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6908) incorrect snapshot directory diff generated by snapshot deletion
[ https://issues.apache.org/jira/browse/HDFS-6908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106272#comment-14106272 ] Jing Zhao commented on HDFS-6908: - Thanks for the response, [~j...@cloudera.com]. bq. so there are create/delete pair operations for those files. The challenge here is that we cannot guarantee we always have the create/delete pair here. Imagine the deletion happens on the directory while the creation happens on a file under the directory. Then we cannot depend on the snapshot diff combination to clean the file. The following unit test (based on your original test case) demos the scenario (but with your patch the following test will fail before the leaking check): {code} @Test (timeout=6) public void testDeleteSnapshot() throws Exception { final Path root = new Path(/); Path dir = new Path(/dir1); Path file1 = new Path(dir, file1); DFSTestUtil.createFile(hdfs, file1, BLOCKSIZE, REPLICATION, seed); hdfs.allowSnapshot(root); hdfs.createSnapshot(root, s1); Path file2 = new Path(dir, file2); DFSTestUtil.createFile(hdfs, file2, BLOCKSIZE, REPLICATION, seed); INodeFile file2Node = fsdir.getINode(file2.toString()).asFile(); long file2NodeId = file2Node.getId(); hdfs.createSnapshot(root, s2); // delete directory assertTrue(hdfs.delete(dir, true)); assertNotNull(fsdir.getInode(file2NodeId)); // delete second snapshot hdfs.deleteSnapshot(root, s2); assertTrue(fsdir.getInode(file2NodeId) == null); NameNodeAdapter.enterSafeMode(cluster.getNameNode(), false); NameNodeAdapter.saveNamespace(cluster.getNameNode()); // restart NN cluster.restartNameNodes(); } {code} incorrect snapshot directory diff generated by snapshot deletion Key: HDFS-6908 URL: https://issues.apache.org/jira/browse/HDFS-6908 Project: Hadoop HDFS Issue Type: Bug Components: snapshots Reporter: Juan Yu Assignee: Juan Yu Priority: Critical Attachments: HDFS-6908.001.patch In the following scenario, delete snapshot could generate incorrect snapshot directory diff and corrupted fsimage, if you restart NN after that, you will get NullPointerException. 1. create a directory and create a file under it 2. take a snapshot 3. create another file under that directory 4. take second snapshot 5. delete both files and the directory 6. delete second snapshot incorrect directory diff will be generated. Restart NN will throw NPE {code} java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.addToDeletedList(FSImageFormatPBSnapshot.java:246) at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadDeletedList(FSImageFormatPBSnapshot.java:265) at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadDirectoryDiffList(FSImageFormatPBSnapshot.java:328) at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadSnapshotDiffSection(FSImageFormatPBSnapshot.java:192) at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:254) at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:168) at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:208) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:906) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:892) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImageFile(FSImage.java:715) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:653) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:276) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:882) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:629) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:498) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:554) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6908) incorrect snapshot directory diff generated by snapshot deletion
[ https://issues.apache.org/jira/browse/HDFS-6908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106507#comment-14106507 ] Juan Yu commented on HDFS-6908: --- [~jingzhao]] Thanks for the new unit test and explain the difference. I assumed when deleting a directory recursively, all children will be added to the diff list. but that's not how the implementation is done. snapshot diff only record directory deletion. so the fix you suggested is better. One more question, I think what's really needed is to call {{cleanSubtreeRecursively}} before {{destroyCreatedList}}, isn't it? {code} + counts.add(currentINode.cleanSubtreeRecursively(snapshot, prior, + collectedBlocks, removedINodes, priorDeleted, countDiffChange)); // delete everything in created list DirectoryDiff lastDiff = diffs.getLast(); if (lastDiff != null) { counts.add(lastDiff.diff.destroyCreatedList(currentINode, collectedBlocks, removedINodes)); } } else { // update prior prior = getDiffs().updatePrior(snapshot, prior); @@ -739,7 +741,10 @@ boolean computeDiffBetweenSnapshots(Snapshot fromSnapshot, counts.add(getDiffs().deleteSnapshotDiff(snapshot, prior, currentINode, collectedBlocks, removedINodes, countDiffChange)); - + + counts.add(currentINode.cleanSubtreeRecursively(snapshot, prior, + collectedBlocks, removedINodes, priorDeleted, countDiffChange)); + // check priorDiff again since it may be created during the diff deletion if (prior != Snapshot.NO_SNAPSHOT_ID) { DirectoryDiff priorDiff = this.getDiffs().getDiffById(prior); @@ -778,9 +783,7 @@ boolean computeDiffBetweenSnapshots(Snapshot fromSnapshot, } } } -counts.add(currentINode.cleanSubtreeRecursively(snapshot, prior, -collectedBlocks, removedINodes, priorDeleted, countDiffChange)); - + if (currentINode.isQuotaSet()) { currentINode.getDirectoryWithQuotaFeature().addSpaceConsumed2Cache( -counts.get(Quota.NAMESPACE), -counts.get(Quota.DISKSPACE)); {code} incorrect snapshot directory diff generated by snapshot deletion Key: HDFS-6908 URL: https://issues.apache.org/jira/browse/HDFS-6908 Project: Hadoop HDFS Issue Type: Bug Components: snapshots Reporter: Juan Yu Assignee: Juan Yu Priority: Critical Attachments: HDFS-6908.001.patch In the following scenario, delete snapshot could generate incorrect snapshot directory diff and corrupted fsimage, if you restart NN after that, you will get NullPointerException. 1. create a directory and create a file under it 2. take a snapshot 3. create another file under that directory 4. take second snapshot 5. delete both files and the directory 6. delete second snapshot incorrect directory diff will be generated. Restart NN will throw NPE {code} java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.addToDeletedList(FSImageFormatPBSnapshot.java:246) at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadDeletedList(FSImageFormatPBSnapshot.java:265) at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadDirectoryDiffList(FSImageFormatPBSnapshot.java:328) at org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadSnapshotDiffSection(FSImageFormatPBSnapshot.java:192) at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:254) at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:168) at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:208) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:906) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:892) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImageFile(FSImage.java:715) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:653) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:276) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:882) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:629) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:498) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:554) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)