We had almost exact problem of namenode filling up and namnode failing at this exact same point. Since you have created space now you can copy over the edits.new, fsimage and the other 2 files from your /mnt/namesecondarynode/current and try restarting the namenode. I believe you will loose some edits and probably some blocks of some files but we could recover most of our files. -Ayon
________________________________ From: Matthew LeMieux <m...@mlogiciels.com> To: hdfs-user@hadoop.apache.org Sent: Tue, October 5, 2010 8:16:15 AM Subject: NameNode crash - cannot start dfs - need help The namenode on an otherwise very stable HDFS cluster crashed recently. The filesystem filled up on the name node, which I assume is what caused the crash. The problem has been fixed, but I cannot get the namenode to restart. I am using version CDH3b2 (hadoop-0.20.2+320). The error is this: 2010-10-05 14:46:55,989 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits file /mnt/name/current/edits of size 157037 edits # 969 loaded in 0 seconds. 2010-10-05 14:46:55,992 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.NumberFormatException: For input string: "128...@^@^...@^@^...@^@^...@^@" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48) at java.lang.Long.parseLong(Long.java:419) at java.lang.Long.parseLong(Long.java:468) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.readLong(FSEditLog.java:1355) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:563) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022) ... This page (http://wiki.apache.org/hadoop/TroubleShooting) recommends editing the edits file with a hex editor, but does not explain where the record boundaries are. It is a different exception, but seemed like a similar cause, the edits file. I tried removing a line at a time, but the error continues, only with a smaller size and edits #: 2010-10-05 14:37:16,635 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits file /mnt/name/current/edits of size 156663 edits # 966 loaded in 0 seconds. 2010-10-05 14:37:16,638 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.NumberFormatException: For input string: "128...@^@^...@^@^...@^@^...@^@" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48) at java.lang.Long.parseLong(Long.java:419) at java.lang.Long.parseLong(Long.java:468) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.readLong(FSEditLog.java:1355) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:563) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022) ... I tried removing the edits file altogether, but that failed with: java.io.IOException: Edits file is not found I tried with a zero length edits file, so it would at least have a file there, but that results in an NPE: 2010-10-05 14:52:34,775 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits file /mnt/name/current/edits of size 0 edits # 0 loaded in 0 seconds. 2010-10-05 14:52:34,776 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1081) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1093) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:996) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:199) Most if not all the files I noticed in the edits file are temporary files that will be deleted once this thing gets back up and running anyway. There is a closed ticket that might be related: https://issues.apache.org/jira/browse/HDFS-686 , but the version I'm using seems to already have HDFS-686 (according to http://archive.cloudera.com/cdh/3/hadoop-0.20.2+320/changes.html) What do I have to do to get back up and running? Thank you for your help, Matthew