The namenode on an otherwise very stable HDFS cluster crashed recently. The filesystem filled up on the name node, which I assume is what caused the crash. The problem has been fixed, but I cannot get the namenode to restart. I am using version CDH3b2 (hadoop-0.20.2+320).
The error is this: 2010-10-05 14:46:55,989 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits file /mnt/name/current/edits of size 157037 edits # 969 loaded in 0 seconds. 2010-10-05 14:46:55,992 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.NumberFormatException: For input string: "128...@^@^...@^@^...@^@^...@^@" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48) at java.lang.Long.parseLong(Long.java:419) at java.lang.Long.parseLong(Long.java:468) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.readLong(FSEditLog.java:1355) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:563) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022) ... This page (http://wiki.apache.org/hadoop/TroubleShooting) recommends editing the edits file with a hex editor, but does not explain where the record boundaries are. It is a different exception, but seemed like a similar cause, the edits file. I tried removing a line at a time, but the error continues, only with a smaller size and edits #: 2010-10-05 14:37:16,635 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits file /mnt/name/current/edits of size 156663 edits # 966 loaded in 0 seconds. 2010-10-05 14:37:16,638 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.NumberFormatException: For input string: "128...@^@^...@^@^...@^@^...@^@" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48) at java.lang.Long.parseLong(Long.java:419) at java.lang.Long.parseLong(Long.java:468) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.readLong(FSEditLog.java:1355) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:563) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022) ... I tried removing the edits file altogether, but that failed with: java.io.IOException: Edits file is not found I tried with a zero length edits file, so it would at least have a file there, but that results in an NPE: 2010-10-05 14:52:34,775 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits file /mnt/name/current/edits of size 0 edits # 0 loaded in 0 seconds. 2010-10-05 14:52:34,776 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1081) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1093) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:996) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:199) Most if not all the files I noticed in the edits file are temporary files that will be deleted once this thing gets back up and running anyway. There is a closed ticket that might be related: https://issues.apache.org/jira/browse/HDFS-686 , but the version I'm using seems to already have HDFS-686 (according to http://archive.cloudera.com/cdh/3/hadoop-0.20.2+320/changes.html) What do I have to do to get back up and running? Thank you for your help, Matthew