Re: NameNode crash - cannot start dfs - need help

Ayon Sinha Tue, 05 Oct 2010 10:18:08 -0700

Hi Matthew,
"(BTW, what if I didn't want to keep my recent edits, and just wanted to start 
up the namenode?   This is currently expensive downtime; I'd rather lose a 
small 
amount of data and be up and running than continue the down time). "
This was exactly my use-case as well. I chose small data loss over spending 
hours on end trying to get past the exceptions. 
Try this:
rename the 4 files under /mnt/name/current to something like *.corrupt


then copy over the 4 files from /mnt/namesecondarynode/current
Make sure you have enough space on the namenode box. 
Try starting the namenode. It worked for me. I was at the same place as you 
only 
a week ago.
 -Ayon





________________________________
From: Matthew LeMieux <m...@mlogiciels.com>
To: hdfs-user@hadoop.apache.org
Sent: Tue, October 5, 2010 9:58:53 AM
Subject: Re: NameNode crash - cannot start dfs - need help

Thank you Todd. 

It does indeed seem like a challenge to find a record boundary, but if I wanted 
to do it...   here is how I did it in case others are interested in doing the 
same.  


It looks like that value (0xFF) is referenced as OP_INVALID in the source file: 
[hadoop-dist]/src//hdfs/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java. 

Every record begins with an op code that describes the record.  The op codes 
are 
in the range [0,14] (inclusive), except for OP_INVALID.  Each record type 
(based 
on op code) appears to have a different format.  Additionally, it seems that 
the 
code for each record type has several code paths to support different versions 
of the hdfs.  

 I looked in the error messages, and found the line number of the exception 
within the switch statement in the code (in this case, line 563).  That told me 
that I was looking for an op code of either 0x00 or 0x09.  I noticed that this 
particular code path had a record type that looked like this: 
[# bytes: name]

[1:op code][4:int length][2:file system path length][?:file system path text]

All I had to do was find a filesystem path, and look 7 bytes before it started. 
 If the op code was a 0x00 or 0x09, then this was a candidate record. 

It would have been easier to just search for something from the error message 
(i.e. "12862" for me) to find candidate records, but in my case that was in 
almost every record.  Additionally, it would have also been easier to just 
search for instances of the op code, but in my case one of the op codes (0x00) 
appears too often in the data to make that useful.   If your op code is 0x03 
for 
example, you will probably have a much easier time of it than I did.  

I was able to successfully and quickly find record boundaries and replace the 
op 
code with 0xff.  After a few records I was back to the NPE exception that I was 
getting with a zero length edits file: 

2010-10-05 16:47:39,670 INFO org.apache.hadoop.hdfs.server.common.Storage: 
Edits 
file /mnt/name/current/edits of size 157037 edits # 959 loaded in 0 seconds.
2010-10-05 16:47:39,671 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: 
java.lang.NullPointerException
        at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1081)

        at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1093)

        at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:996)
        at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:199)

        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:627)
        at 
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022)
        at 
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:830)
        at 
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:378)

        at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:92)


One hurdle down, how do I get past the next one?

(BTW, what if I didn't want to keep my recent edits, and just wanted to start 
up 
the namenode?   This is currently expensive downtime; I'd rather lose a small 
amount of data and be up and running than continue the down time). 

Thank you for your help, 

Matthew

On Oct 5, 2010, at 8:42 AM, Todd Lipcon wrote:

Hi Matt,
>
>
>If you want to keep your recent edits, you'll have to place an 0xFF at the 
>beginning of the most recent edit entry in the edit log. It's a bit tough to 
>find these boundaries, but you can try applying this patch and rebuilding:
>
>
>https://issues.apache.org/jira/browse/hdfs-1378
>
>
>This will tell you the offset of the broken entry ("recent opcodes") and you 
>can 
>put an 0xff there to tie off the file before the corrupt entry.
>
>
>-Todd
>
>
>
>
>On Tue, Oct 5, 2010 at 8:16 AM, Matthew LeMieux <m...@mlogiciels.com> wrote:
>
>The namenode on an otherwise very stable HDFS cluster crashed recently.  The 
>filesystem filled up on the name node, which I assume is what caused the 
>crash. 
>   The problem has been fixed, but I cannot get the namenode to restart.  I am 
>using version CDH3b2  (hadoop-0.20.2+320). 
>>
>>
>>The error is this: 
>>
>>
>>2010-10-05 14:46:55,989 INFO org.apache.hadoop.hdfs.server.common.Storage: 
>>Edits 
>>file /mnt/name/current/edits of size 157037 edits # 969 loaded in 0 seconds.
>>2010-10-05 14:46:55,992 ERROR 
>>org.apache.hadoop.hdfs.server.namenode.NameNode: 
>>java.lang.NumberFormatException: For input string: 
>>"128...@^@^...@^@^...@^@^...@^@"
>>        at 
>>java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
>>        at java.lang.Long.parseLong(Long.java:419)
>>        at java.lang.Long.parseLong(Long.java:468)
>>        at 
>>org.apache.hadoop.hdfs.server.namenode.FSEditLog.readLong(FSEditLog.java:1355)
>>        at 
>>org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:563)
>>        at 
>>org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022)
>>        ...
>>
>>
>>This page (http://wiki.apache.org/hadoop/TroubleShooting) recommends editing 
>>the 
>>edits file with a hex editor, but does not explain where the record 
>>boundaries 
>>are.  It is a different exception, but seemed like a similar cause, the edits 
>>file.  I tried removing a line at a time, but the error continues, only with 
>>a 
>>smaller size and edits #: 
>>
>>
>>2010-10-05 14:37:16,635 INFO org.apache.hadoop.hdfs.server.common.Storage: 
>>Edits 
>>file /mnt/name/current/edits of size 156663 edits # 966 loaded in 0 seconds.
>>2010-10-05 14:37:16,638 ERROR 
>>org.apache.hadoop.hdfs.server.namenode.NameNode: 
>>java.lang.NumberFormatException: For input string: 
>>"128...@^@^...@^@^...@^@^...@^@"
>>        at 
>>java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
>>        at java.lang.Long.parseLong(Long.java:419)
>>        at java.lang.Long.parseLong(Long.java:468)
>>        at 
>>org.apache.hadoop.hdfs.server.namenode.FSEditLog.readLong(FSEditLog.java:1355)
>>        at 
>>org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:563)
>>        at 
>>org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022)
>>        ...
>>
>>
>>I tried removing the edits file altogether, but that failed 
>>with: java.io.IOException: Edits file is not found
>>
>>
>>I tried with a zero length edits file, so it would at least have a file 
>>there, 
>>but that results in an NPE: 
>>
>>
>>2010-10-05 14:52:34,775 INFO org.apache.hadoop.hdfs.server.common.Storage: 
>>Edits 
>>file /mnt/name/current/edits of size 0 edits # 0 loaded in 0 seconds.
>>2010-10-05 14:52:34,776 ERROR 
>>org.apache.hadoop.hdfs.server.namenode.NameNode: 
>>java.lang.NullPointerException
>>        at 
>>org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1081)
>>
>>        at 
>>org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1093)
>>
>>        at 
>>org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:996)
>>        at 
>>org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:199)
>>
>>
>>
>>
>>
>>Most if not all the files I noticed in the edits file are temporary files 
>>that 
>>will be deleted once this thing gets back up and running anyway.    There is 
>>a 
>>closed ticket that might be 
>>related: https://issues.apache.org/jira/browse/HDFS-686 ,  but the version 
>>I'm 
>>using seems to already have HDFS-686 (according 
>>to http://archive.cloudera.com/cdh/3/hadoop-0.20.2+320/changes.html)  
>>
>>
>>What do I have to do to get back up and running?
>>
>>
>>Thank you for your help, 
>>
>>Matthew
>>
>>
>>
>>
>
>
>-- 
>Todd Lipcon
>Software Engineer, Cloudera
>

Re: NameNode crash - cannot start dfs - need help

Reply via email to