Hello, I am trying to recover a namenode that failed, maybe using the checkpoint node. When I start dfs, I get this in the logs (see end of email). I think my metadata is corrupt. I also think this is because hadoop was checkpointing and the machine shut down at the same time. Note that this is a pseudo distributed installation. Here is the content of namedir (see end of email) I tried replacing the current fsimage by the checkpoint fsimage, remove edits.new and have an empty edits file and this way I get a working hdfs but it is too old. Do you have any suggestions to recover the most recent fsimage, maybe by fiddling with edits and edits.new ?
Thanks very much in advance, Juan ------------------------------------- content of namedir ls -l -R /scratch/namedir/ /scratch/namedir/: total 12 drwxr-xr-x 2 hadoop hadoop 4096 2012-03-22 22:06 current drwxr-xr-x 2 hadoop hadoop 4096 2012-03-20 16:18 image drwxr-xr-x 2 hadoop hadoop 4096 2012-03-20 17:28 previous.checkpoint /scratch/namedir/current: total 2168 -rw-r--r-- 1 hadoop hadoop 6417 2012-03-20 19:28 edits -rw-r--r-- 1 hadoop hadoop 2094127 2012-03-22 17:25 edits.new -rw-r--r-- 1 hadoop hadoop 105538 2012-03-20 18:28 fsimage -rw-r--r-- 1 hadoop hadoop 8 2012-03-22 22:06 fstime -rw-r--r-- 1 hadoop hadoop 101 2012-03-20 18:28 VERSION /scratch/namedir/image: total 4 -rw-r--r-- 1 hadoop hadoop 157 2012-03-20 18:28 fsimage /scratch/namedir/previous.checkpoint: total 160 -rw-r--r-- 1 hadoop hadoop 85345 2012-03-20 18:28 edits -rw-r--r-- 1 hadoop hadoop 67295 2012-03-20 17:28 fsimage -rw-r--r-- 1 hadoop hadoop 8 2012-03-20 17:28 fstime -rw-r--r-- 1 hadoop hadoop 101 2012-03-20 17:28 VERSION ------------------------------------- logs when starting dfs hadoop-hadoop-secondarynamenode-mymachine.log 2012-03-23 10:58:50,617 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried 7 time(s). 2012-03-23 10:58:51,618 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried 8 time(s). hadoop-hadoop-secondarynamenode-mymachine.log 2012-03-23 10:59:19,434 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried 1 time(s). 2012-03-23 10:59:20,434 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried 2 time(s). hadoop-hadoop-namenode-mymachine.log 2012-03-23 10:58:40,988 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.NullPointerException: Panic: parent does not exist at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1508) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1522) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1407) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:216) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:526) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:411) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:378) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1209) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:1019) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:483) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:110) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:291) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:270) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:271) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:303) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:433) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:421) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1359) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1368) 2012-03-23 10:58:40,989 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at air099/127.0.1.1 ************************************************************/