[ https://issues.apache.org/jira/browse/HDFS-9406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15123987#comment-15123987 ]
Jing Zhao commented on HDFS-9406: --------------------------------- Thanks for the patch, Yongjun! The patch looks good to me. But looks like we need to fix {{TestINodeFile#testClearBlocks}} because of the new {{clearBlocks}} logic. In the meanwhile, can we also add a test for the case you mentioned in this jira? Although this one may not cause fsimage corruption, but we can check if the file has finally been deleted from the inodeMap. {code} @Test public void testRenameAndDelete() throws IOException { final Path foo = new Path("/foo"); final Path x = new Path(foo, "x"); final Path y = new Path(foo, "y"); final Path trash = new Path("/trash"); fs.mkdirs(x); fs.mkdirs(y); fs.mkdirs(trash); fs.allowSnapshot(foo); // 1. create snapshot s0 fs.createSnapshot(foo, "s0"); // 2. create file /foo/x/bar final Path file = new Path(x, "bar"); DFSTestUtil.createFile(fs, file, BLOCKSIZE, (short) 1, 0L); final long fileId = fsdir.getINode4Write(file.toString()).getId(); // 3. move file into /foo/y final Path newFile = new Path(y, "bar"); fs.rename(file, newFile); // 4. create snapshot s1 fs.createSnapshot(foo, "s1"); // 5. move /foo/y to /trash final Path deletedY = new Path(trash, "y"); fs.rename(y, deletedY); // 6. create snapshot s2 fs.createSnapshot(foo, "s2"); // 7. delete /trash/y fs.delete(deletedY, true); // 8. delete snapshot s1 fs.deleteSnapshot(foo, "s1"); // make sure bar has been cleaned Assert.assertNull(fsdir.getInode(fileId)); } {code} > FSImage corruption after taking snapshot > ---------------------------------------- > > Key: HDFS-9406 > URL: https://issues.apache.org/jira/browse/HDFS-9406 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 2.6.0 > Environment: CentOS 6 amd64, CDH 5.4.4-1 > 2xCPU: Intel(R) Xeon(R) CPU E5-2640 v3 > Memory: 32GB > Namenode blocks: ~700_000 blocks, no HA setup > Reporter: Stanislav Antic > Assignee: Yongjun Zhang > Attachments: HDFS-9406.001.patch, HDFS-9406.002.patch > > > FSImage corruption happened after HDFS snapshots were taken. Cluster was not > used > at that time. > When namenode restarts it reported NULL pointer exception: > {code} > 15/11/07 10:01:15 INFO namenode.FileJournalManager: Recovering unfinalized > segments in /tmp/fsimage_checker_5857/fsimage/current > 15/11/07 10:01:15 INFO namenode.FSImage: No edit log streams selected. > 15/11/07 10:01:18 INFO namenode.FSImageFormatPBINode: Loading 1370277 INodes. > 15/11/07 10:01:27 ERROR namenode.NameNode: Failed to start namenode. > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.namenode.INodeDirectory.addChild(INodeDirectory.java:531) > at > org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.addToParent(FSImageFormatPBINode.java:252) > at > org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINodeDirectorySection(FSImageFormatPBINode.java:202) > at > org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:261) > at > org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:180) > at > org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:226) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:929) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:913) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImageFile(FSImage.java:732) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:668) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:281) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1061) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:765) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:584) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:643) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:810) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:794) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1487) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1553) > 15/11/07 10:01:27 INFO util.ExitUtil: Exiting with status 1 > {code} > Corruption happened after "07.11.2015 00:15", and after that time blocks > ~9300 blocks were invalidated that shouldn't be. > After recovering FSimage I discovered that around ~9300 blocks were missing. > -I also attached log of namenode before and after corruption happened.- -- This message was sent by Atlassian JIRA (v6.3.4#6332)