[ 
https://issues.apache.org/jira/browse/HDFS-6908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14310377#comment-14310377
 ] 

Abhishek Rai commented on HDFS-6908:
------------------------------------

We have an HDFS installation in production where we ran into this problem.  
Since the fsimage is corrupt, namenode fails to come up, leaving the system 
unusable.  We suspect that the problem was triggered in our case by deletion of 
one of the existing snapshots of a large directory containing several 
sub-directories and files.

While the proposed fix above is definitely useful to prevent this issue going 
forward, is there any recommendation for how to fix an fsimage which was 
already corrupted by this bug?

We temporarily put in the following hack in FSImageFormatPBSnapshot.java to 
mask this problem.  The risk with this hack is that it can mask other 
bugs/corruptions, and since it doesn't fix the corrupt fsimage on disk, the 
hack will always be needed to make the namenode work.

{noformat}
+++ 
b/hadoop/hadoop-2.5.1-src/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/snapshot/FSImageFormatPBSnapshot.java
@@ -34,6 +34,8 @@
 import java.util.List;
 import java.util.Map;
 
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
 import org.apache.hadoop.classification.InterfaceAudience;
 import org.apache.hadoop.fs.permission.PermissionStatus;
 import org.apache.hadoop.hdfs.server.namenode.AclFeature;
@@ -73,6 +75,9 @@
 
 @InterfaceAudience.Private
 public class FSImageFormatPBSnapshot {
+  public static final Log LOG = 
LogFactory.getLog(FSImageFormatPBSnapshot.class);
+
+
   /**
    * Loading snapshot related information from protobuf based FSImage
    */
@@ -267,8 +272,12 @@ private void addToDeletedList(INode dnode, INodeDirectory 
parent) {
       // load non-reference inodes
       for (long deletedId : deletedNodes) {
         INode deleted = fsDir.getInode(deletedId);
-        dlist.add(deleted);
-        addToDeletedList(deleted, dir);
+        if (deleted != null) {
+          dlist.add(deleted);
+          addToDeletedList(deleted, dir);
+        } else {
+          LOG.error("Could not find inode " + deletedId + " from deleted-list 
of directory: " + dir.toDetailString());
+        }
{noformat}

> incorrect snapshot directory diff generated by snapshot deletion
> ----------------------------------------------------------------
>
>                 Key: HDFS-6908
>                 URL: https://issues.apache.org/jira/browse/HDFS-6908
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: snapshots
>            Reporter: Juan Yu
>            Assignee: Juan Yu
>            Priority: Critical
>             Fix For: 2.6.0
>
>         Attachments: HDFS-6908.001.patch, HDFS-6908.002.patch, 
> HDFS-6908.003.patch
>
>
> In the following scenario, delete snapshot could generate incorrect snapshot 
> directory diff and corrupted fsimage, if you restart NN after that, you will 
> get NullPointerException.
> 1. create a directory and create a file under it
> 2. take a snapshot
> 3. create another file under that directory
> 4. take second snapshot
> 5. delete both files and the directory
> 6. delete second snapshot
> incorrect directory diff will be generated.
> Restart NN will throw NPE
> {code}
> java.lang.NullPointerException
>       at 
> org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.addToDeletedList(FSImageFormatPBSnapshot.java:246)
>       at 
> org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadDeletedList(FSImageFormatPBSnapshot.java:265)
>       at 
> org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadDirectoryDiffList(FSImageFormatPBSnapshot.java:328)
>       at 
> org.apache.hadoop.hdfs.server.namenode.snapshot.FSImageFormatPBSnapshot$Loader.loadSnapshotDiffSection(FSImageFormatPBSnapshot.java:192)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:254)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:168)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:208)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:906)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:892)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImageFile(FSImage.java:715)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:653)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:276)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:882)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:629)
>       at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:498)
>       at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:554)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to