Thank you Ayon, Allen and Todd for your suggestions. I was tempted to try to find the offending records in edits.new, but opted for simply moving the file instead. I kept the recently edited edits file in place.
The namenode started up this time with no exceptions and appears to be running well; hadoop fsck / reports a healthy filesystem. Thank you, Matthew On Oct 5, 2010, at 10:09 AM, Todd Lipcon wrote: > On Tue, Oct 5, 2010 at 9:58 AM, Matthew LeMieux <m...@mlogiciels.com> wrote: > Thank you Todd. > > It does indeed seem like a challenge to find a record boundary, but if I > wanted to do it... here is how I did it in case others are interested in > doing the same. > > It looks like that value (0xFF) is referenced as OP_INVALID in the source > file: > [hadoop-dist]/src//hdfs/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java. > > > Every record begins with an op code that describes the record. The op codes > are in the range [0,14] (inclusive), except for OP_INVALID. Each record type > (based on op code) appears to have a different format. Additionally, it > seems that the code for each record type has several code paths to support > different versions of the hdfs. > > I looked in the error messages, and found the line number of the exception > within the switch statement in the code (in this case, line 563). That told > me that I was looking for an op code of either 0x00 or 0x09. I noticed that > this particular code path had a record type that looked like this: > [# bytes: name] > > [1:op code][4:int length][2:file system path length][?:file system path text] > > All I had to do was find a filesystem path, and look 7 bytes before it > started. If the op code was a 0x00 or 0x09, then this was a candidate > record. > > It would have been easier to just search for something from the error message > (i.e. "12862" for me) to find candidate records, but in my case that was in > almost every record. Additionally, it would have also been easier to just > search for instances of the op code, but in my case one of the op codes > (0x00) appears too often in the data to make that useful. If your op code > is 0x03 for example, you will probably have a much easier time of it than I > did. > > I was able to successfully and quickly find record boundaries and replace the > op code with 0xff. After a few records I was back to the NPE exception that > I was getting with a zero length edits file: > > 2010-10-05 16:47:39,670 INFO org.apache.hadoop.hdfs.server.common.Storage: > Edits file /mnt/name/current/edits of size 157037 edits # 959 loaded in 0 > seconds. > 2010-10-05 16:47:39,671 ERROR > org.apache.hadoop.hdfs.server.namenode.NameNode: > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1081) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1093) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:996) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:199) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:627) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:830) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:378) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:92) > > One hurdle down, how do I get past the next one? > > It's unclear whether you're getting the error in "edits" or "edits.new". From > the above, I'm guessing maybe "edits" is corrupt, so when you fixed the error > there (by truncating a few edits from the end), then the later edits in > edits.new failed, because they depended on a path that should have been > created by "edits". > > (BTW, what if I didn't want to keep my recent edits, and just wanted to start > up the namenode? This is currently expensive downtime; I'd rather lose a > small amount of data and be up and running than continue the down time). > > If you really want to do this, you can remove "edits.new", and replace > "edits" with a file containing hex 0xffffffeeff I believe (edits header plus > OP_INVALID) > > -Todd > > Oct 5, 2010, at 8:42 AM, Todd Lipcon wrote: > >> Hi Matt, >> >> If you want to keep your recent edits, you'll have to place an 0xFF at the >> beginning of the most recent edit entry in the edit log. It's a bit tough to >> find these boundaries, but you can try applying this patch and rebuilding: >> >> https://issues.apache.org/jira/browse/hdfs-1378 >> >> This will tell you the offset of the broken entry ("recent opcodes") and you >> can put an 0xff there to tie off the file before the corrupt entry. >> >> -Todd >> >> >> On Tue, Oct 5, 2010 at 8:16 AM, Matthew LeMieux <m...@mlogiciels.com> wrote: >> The namenode on an otherwise very stable HDFS cluster crashed recently. The >> filesystem filled up on the name node, which I assume is what caused the >> crash. The problem has been fixed, but I cannot get the namenode to >> restart. I am using version CDH3b2 (hadoop-0.20.2+320). >> >> The error is this: >> >> 2010-10-05 14:46:55,989 INFO org.apache.hadoop.hdfs.server.common.Storage: >> Edits file /mnt/name/current/edits of size 157037 edits # 969 loaded in 0 >> seconds. >> 2010-10-05 14:46:55,992 ERROR >> org.apache.hadoop.hdfs.server.namenode.NameNode: >> java.lang.NumberFormatException: For input string: >> "128...@^@^...@^@^...@^@^...@^@" >> at >> java.lang.NumberFormatException.forInputString(NumberFormatException.java:48) >> at java.lang.Long.parseLong(Long.java:419) >> at java.lang.Long.parseLong(Long.java:468) >> at >> org.apache.hadoop.hdfs.server.namenode.FSEditLog.readLong(FSEditLog.java:1355) >> at >> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:563) >> at >> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022) >> ... >> >> This page (http://wiki.apache.org/hadoop/TroubleShooting) recommends editing >> the edits file with a hex editor, but does not explain where the record >> boundaries are. It is a different exception, but seemed like a similar >> cause, the edits file. I tried removing a line at a time, but the error >> continues, only with a smaller size and edits #: >> >> 2010-10-05 14:37:16,635 INFO org.apache.hadoop.hdfs.server.common.Storage: >> Edits file /mnt/name/current/edits of size 156663 edits # 966 loaded in 0 >> seconds. >> 2010-10-05 14:37:16,638 ERROR >> org.apache.hadoop.hdfs.server.namenode.NameNode: >> java.lang.NumberFormatException: For input string: >> "128...@^@^...@^@^...@^@^...@^@" >> at >> java.lang.NumberFormatException.forInputString(NumberFormatException.java:48) >> at java.lang.Long.parseLong(Long.java:419) >> at java.lang.Long.parseLong(Long.java:468) >> at >> org.apache.hadoop.hdfs.server.namenode.FSEditLog.readLong(FSEditLog.java:1355) >> at >> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:563) >> at >> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022) >> ... >> >> I tried removing the edits file altogether, but that failed with: >> java.io.IOException: Edits file is not found >> >> I tried with a zero length edits file, so it would at least have a file >> there, but that results in an NPE: >> >> 2010-10-05 14:52:34,775 INFO org.apache.hadoop.hdfs.server.common.Storage: >> Edits file /mnt/name/current/edits of size 0 edits # 0 loaded in 0 seconds. >> 2010-10-05 14:52:34,776 ERROR >> org.apache.hadoop.hdfs.server.namenode.NameNode: >> java.lang.NullPointerException >> at >> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1081) >> at >> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1093) >> at >> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:996) >> at >> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:199) >> >> >> Most if not all the files I noticed in the edits file are temporary files >> that will be deleted once this thing gets back up and running anyway. >> There is a closed ticket that might be related: >> https://issues.apache.org/jira/browse/HDFS-686 , but the version I'm using >> seems to already have HDFS-686 (according to >> http://archive.cloudera.com/cdh/3/hadoop-0.20.2+320/changes.html) >> >> What do I have to do to get back up and running? >> >> Thank you for your help, >> >> Matthew >> >> >> >> >> >> -- >> Todd Lipcon >> Software Engineer, Cloudera > > > > > -- > Todd Lipcon > Software Engineer, Cloudera