On Tue, Oct 5, 2010 at 9:58 AM, Matthew LeMieux <m...@mlogiciels.com> wrote:
> Thank you Todd. > > It does indeed seem like a challenge to find a record boundary, but if I > wanted to do it... here is how I did it in case others are interested in > doing the same. > > It looks like that value (0xFF) is referenced as OP_INVALID in the source > file: > [hadoop-dist]/src//hdfs/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java. > > Every record begins with an op code that describes the record. The op > codes are in the range [0,14] (inclusive), except for OP_INVALID. Each > record type (based on op code) appears to have a different format. > Additionally, it seems that the code for each record type has several code > paths to support different versions of the hdfs. > > I looked in the error messages, and found the line number of the exception > within the switch statement in the code (in this case, line 563). That told > me that I was looking for an op code of either 0x00 or 0x09. I noticed that > this particular code path had a record type that looked like this: > [# bytes: name] > > [1:op code][4:int length][2:file system path length][?:file system path > text] > > All I had to do was find a filesystem path, and look 7 bytes before it > started. If the op code was a 0x00 or 0x09, then this was a candidate > record. > > It would have been easier to just search for something from the error > message (i.e. "12862" for me) to find candidate records, but in my case that > was in almost every record. Additionally, it would have also been easier to > just search for instances of the op code, but in my case one of the op codes > (0x00) appears too often in the data to make that useful. If your op code > is 0x03 for example, you will probably have a much easier time of it than I > did. > > I was able to successfully and quickly find record boundaries and replace > the op code with 0xff. After a few records I was back to the NPE exception > that I was getting with a zero length edits file: > > 2010-10-05 16:47:39,670 INFO org.apache.hadoop.hdfs.server.common.Storage: > Edits file /mnt/name/current/edits of size 157037 edits # 959 loaded in 0 > seconds. > 2010-10-05 16:47:39,671 ERROR > org.apache.hadoop.hdfs.server.namenode.NameNode: > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1081) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1093) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:996) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:199) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:627) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:830) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:378) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:92) > > One hurdle down, how do I get past the next one? > It's unclear whether you're getting the error in "edits" or "edits.new". >From the above, I'm guessing maybe "edits" is corrupt, so when you fixed the error there (by truncating a few edits from the end), then the later edits in edits.new failed, because they depended on a path that should have been created by "edits". > > (BTW, what if I didn't want to keep my recent edits, and just wanted to > start up the namenode? This is currently expensive downtime; I'd rather > lose a small amount of data and be up and running than continue the down > time). > If you really want to do this, you can remove "edits.new", and replace "edits" with a file containing hex 0xffffffeeff I believe (edits header plus OP_INVALID) -Todd Oct 5, 2010, at 8:42 AM, Todd Lipcon wrote: > > Hi Matt, > > If you want to keep your recent edits, you'll have to place an 0xFF at the > beginning of the most recent edit entry in the edit log. It's a bit tough to > find these boundaries, but you can try applying this patch and rebuilding: > > https://issues.apache.org/jira/browse/hdfs-1378 > > This will tell you the offset of the broken entry ("recent opcodes") and > you can put an 0xff there to tie off the file before the corrupt entry. > > -Todd > > > On Tue, Oct 5, 2010 at 8:16 AM, Matthew LeMieux <m...@mlogiciels.com>wrote: > >> The namenode on an otherwise very stable HDFS cluster crashed recently. >> The filesystem filled up on the name node, which I assume is what caused >> the crash. The problem has been fixed, but I cannot get the namenode to >> restart. I am using version CDH3b2 (hadoop-0.20.2+320). >> >> The error is this: >> >> 2010-10-05 14:46:55,989 INFO org.apache.hadoop.hdfs.server.common.Storage: >> Edits file /mnt/name/current/edits of size 157037 edits # 969 loaded in 0 >> seconds. >> 2010-10-05 14:46:55,992 ERROR >> org.apache.hadoop.hdfs.server.namenode.NameNode: >> java.lang.NumberFormatException: For input string: >> "128...@^@^...@^@^...@^@^...@^@" >> at >> java.lang.NumberFormatException.forInputString(NumberFormatException.java:48) >> at java.lang.Long.parseLong(Long.java:419) >> at java.lang.Long.parseLong(Long.java:468) >> at >> org.apache.hadoop.hdfs.server.namenode.FSEditLog.readLong(FSEditLog.java:1355) >> at >> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:563) >> at >> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022) >> ... >> >> This page (http://wiki.apache.org/hadoop/TroubleShooting) recommends >> editing the edits file with a hex editor, but does not explain where the >> record boundaries are. It is a different exception, but seemed like a >> similar cause, the edits file. I tried removing a line at a time, but the >> error continues, only with a smaller size and edits #: >> >> 2010-10-05 14:37:16,635 INFO org.apache.hadoop.hdfs.server.common.Storage: >> Edits file /mnt/name/current/edits of size 156663 edits # 966 loaded in 0 >> seconds. >> 2010-10-05 14:37:16,638 ERROR >> org.apache.hadoop.hdfs.server.namenode.NameNode: >> java.lang.NumberFormatException: For input string: >> "128...@^@^...@^@^...@^@^...@^@" >> at >> java.lang.NumberFormatException.forInputString(NumberFormatException.java:48) >> at java.lang.Long.parseLong(Long.java:419) >> at java.lang.Long.parseLong(Long.java:468) >> at >> org.apache.hadoop.hdfs.server.namenode.FSEditLog.readLong(FSEditLog.java:1355) >> at >> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:563) >> at >> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022) >> ... >> >> I tried removing the edits file altogether, but that failed >> with: java.io.IOException: Edits file is not found >> >> I tried with a zero length edits file, so it would at least have a file >> there, but that results in an NPE: >> >> 2010-10-05 14:52:34,775 INFO org.apache.hadoop.hdfs.server.common.Storage: >> Edits file /mnt/name/current/edits of size 0 edits # 0 loaded in 0 seconds. >> 2010-10-05 14:52:34,776 ERROR >> org.apache.hadoop.hdfs.server.namenode.NameNode: >> java.lang.NullPointerException >> at >> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1081) >> at >> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1093) >> at >> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:996) >> at >> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:199) >> >> >> Most if not all the files I noticed in the edits file are temporary files >> that will be deleted once this thing gets back up and running anyway. >> There is a closed ticket that might be related: >> https://issues.apache.org/jira/browse/HDFS-686 , but the version I'm >> using seems to already have HDFS-686 (according to >> http://archive.cloudera.com/cdh/3/hadoop-0.20.2+320/changes.html) >> >> What do I have to do to get back up and running? >> >> Thank you for your help, >> >> Matthew >> >> >> > > > -- > Todd Lipcon > Software Engineer, Cloudera > > > -- Todd Lipcon Software Engineer, Cloudera