The namenode on an otherwise very stable HDFS cluster crashed recently.  The 
filesystem filled up on the name node, which I assume is what caused the crash. 
   The problem has been fixed, but I cannot get the namenode to restart.  I am 
using version CDH3b2  (hadoop-0.20.2+320). 

The error is this: 

2010-10-05 14:46:55,989 INFO org.apache.hadoop.hdfs.server.common.Storage: 
Edits file /mnt/name/current/edits of size 157037 edits # 969 loaded in 0 
seconds.
2010-10-05 14:46:55,992 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: 
java.lang.NumberFormatException: For input string: 
"128...@^@^...@^@^...@^@^...@^@"
        at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
        at java.lang.Long.parseLong(Long.java:419)
        at java.lang.Long.parseLong(Long.java:468)
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.readLong(FSEditLog.java:1355)
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:563)
        at 
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022)
        ...

This page (http://wiki.apache.org/hadoop/TroubleShooting) recommends editing 
the edits file with a hex editor, but does not explain where the record 
boundaries are.  It is a different exception, but seemed like a similar cause, 
the edits file.  I tried removing a line at a time, but the error continues, 
only with a smaller size and edits #: 

2010-10-05 14:37:16,635 INFO org.apache.hadoop.hdfs.server.common.Storage: 
Edits file /mnt/name/current/edits of size 156663 edits # 966 loaded in 0 
seconds.
2010-10-05 14:37:16,638 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: 
java.lang.NumberFormatException: For input string: 
"128...@^@^...@^@^...@^@^...@^@"
        at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
        at java.lang.Long.parseLong(Long.java:419)
        at java.lang.Long.parseLong(Long.java:468)
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.readLong(FSEditLog.java:1355)
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:563)
        at 
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022)
        ...

I tried removing the edits file altogether, but that failed with: 
java.io.IOException: Edits file is not found

I tried with a zero length edits file, so it would at least have a file there, 
but that results in an NPE: 

2010-10-05 14:52:34,775 INFO org.apache.hadoop.hdfs.server.common.Storage: 
Edits file /mnt/name/current/edits of size 0 edits # 0 loaded in 0 seconds.
2010-10-05 14:52:34,776 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: 
java.lang.NullPointerException
        at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1081)
        at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1093)
        at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:996)
        at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:199)


Most if not all the files I noticed in the edits file are temporary files that 
will be deleted once this thing gets back up and running anyway.    There is a 
closed ticket that might be related: 
https://issues.apache.org/jira/browse/HDFS-686 ,  but the version I'm using 
seems to already have HDFS-686 (according to 
http://archive.cloudera.com/cdh/3/hadoop-0.20.2+320/changes.html)  

What do I have to do to get back up and running?

Thank you for your help, 

Matthew


Reply via email to