Re: NameNode crash - cannot start dfs - need help

Matthew LeMieux Tue, 05 Oct 2010 11:26:25 -0700

Thank you Ayon, Allen and Todd for your suggestions. 

I was tempted to try to find the offending records in edits.new, but opted for 
simply moving the file instead.  I kept the recently edited edits file in 
place.


The namenode started up this time with no exceptions and appears to be running 
well;     hadoop fsck / reports a healthy filesystem. 

Thank you, 

Matthew

On Oct 5, 2010, at 10:09 AM, Todd Lipcon wrote:

> On Tue, Oct 5, 2010 at 9:58 AM, Matthew LeMieux <m...@mlogiciels.com> wrote:
> Thank you Todd. 
> 
> It does indeed seem like a challenge to find a record boundary, but if I 
> wanted to do it...   here is how I did it in case others are interested in 
> doing the same.  
> 
> It looks like that value (0xFF) is referenced as OP_INVALID in the source 
> file: 
> [hadoop-dist]/src//hdfs/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java.
>  
> 
> Every record begins with an op code that describes the record.  The op codes 
> are in the range [0,14] (inclusive), except for OP_INVALID.  Each record type 
> (based on op code) appears to have a different format.  Additionally, it 
> seems that the code for each record type has several code paths to support 
> different versions of the hdfs.  
> 
>  I looked in the error messages, and found the line number of the exception 
> within the switch statement in the code (in this case, line 563).  That told 
> me that I was looking for an op code of either 0x00 or 0x09.  I noticed that 
> this particular code path had a record type that looked like this: 
> [# bytes: name]
> 
> [1:op code][4:int length][2:file system path length][?:file system path text]
> 
> All I had to do was find a filesystem path, and look 7 bytes before it 
> started.  If the op code was a 0x00 or 0x09, then this was a candidate 
> record. 
> 
> It would have been easier to just search for something from the error message 
> (i.e. "12862" for me) to find candidate records, but in my case that was in 
> almost every record.  Additionally, it would have also been easier to just 
> search for instances of the op code, but in my case one of the op codes 
> (0x00) appears too often in the data to make that useful.   If your op code 
> is 0x03 for example, you will probably have a much easier time of it than I 
> did.  
> 
> I was able to successfully and quickly find record boundaries and replace the 
> op code with 0xff.  After a few records I was back to the NPE exception that 
> I was getting with a zero length edits file: 
> 
> 2010-10-05 16:47:39,670 INFO org.apache.hadoop.hdfs.server.common.Storage: 
> Edits file /mnt/name/current/edits of size 157037 edits # 959 loaded in 0 
> seconds.
> 2010-10-05 16:47:39,671 ERROR 
> org.apache.hadoop.hdfs.server.namenode.NameNode: 
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1081)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1093)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:996)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:199)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:627)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:830)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:378)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:92)
> 
> One hurdle down, how do I get past the next one?
> 
> It's unclear whether you're getting the error in "edits" or "edits.new". From 
> the above, I'm guessing maybe "edits" is corrupt, so when you fixed the error 
> there (by truncating a few edits from the end), then the later edits in 
> edits.new failed, because they depended on a path that should have been 
> created by "edits".
> 
> (BTW, what if I didn't want to keep my recent edits, and just wanted to start 
> up the namenode?   This is currently expensive downtime; I'd rather lose a 
> small amount of data and be up and running than continue the down time). 
> 
> If you really want to do this, you can remove "edits.new", and replace 
> "edits" with a file containing hex 0xffffffeeff I believe (edits header plus 
> OP_INVALID)
> 
> -Todd
>  
> Oct 5, 2010, at 8:42 AM, Todd Lipcon wrote:
> 
>> Hi Matt,
>> 
>> If you want to keep your recent edits, you'll have to place an 0xFF at the 
>> beginning of the most recent edit entry in the edit log. It's a bit tough to 
>> find these boundaries, but you can try applying this patch and rebuilding:
>> 
>> https://issues.apache.org/jira/browse/hdfs-1378
>> 
>> This will tell you the offset of the broken entry ("recent opcodes") and you 
>> can put an 0xff there to tie off the file before the corrupt entry.
>> 
>> -Todd
>> 
>> 
>> On Tue, Oct 5, 2010 at 8:16 AM, Matthew LeMieux <m...@mlogiciels.com> wrote:
>> The namenode on an otherwise very stable HDFS cluster crashed recently.  The 
>> filesystem filled up on the name node, which I assume is what caused the 
>> crash.    The problem has been fixed, but I cannot get the namenode to 
>> restart.  I am using version CDH3b2  (hadoop-0.20.2+320). 
>> 
>> The error is this: 
>> 
>> 2010-10-05 14:46:55,989 INFO org.apache.hadoop.hdfs.server.common.Storage: 
>> Edits file /mnt/name/current/edits of size 157037 edits # 969 loaded in 0 
>> seconds.
>> 2010-10-05 14:46:55,992 ERROR 
>> org.apache.hadoop.hdfs.server.namenode.NameNode: 
>> java.lang.NumberFormatException: For input string: 
>> "128...@^@^...@^@^...@^@^...@^@"
>>         at 
>> java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
>>         at java.lang.Long.parseLong(Long.java:419)
>>         at java.lang.Long.parseLong(Long.java:468)
>>         at 
>> org.apache.hadoop.hdfs.server.namenode.FSEditLog.readLong(FSEditLog.java:1355)
>>         at 
>> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:563)
>>         at 
>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022)
>>         ...
>> 
>> This page (http://wiki.apache.org/hadoop/TroubleShooting) recommends editing 
>> the edits file with a hex editor, but does not explain where the record 
>> boundaries are.  It is a different exception, but seemed like a similar 
>> cause, the edits file.  I tried removing a line at a time, but the error 
>> continues, only with a smaller size and edits #: 
>> 
>> 2010-10-05 14:37:16,635 INFO org.apache.hadoop.hdfs.server.common.Storage: 
>> Edits file /mnt/name/current/edits of size 156663 edits # 966 loaded in 0 
>> seconds.
>> 2010-10-05 14:37:16,638 ERROR 
>> org.apache.hadoop.hdfs.server.namenode.NameNode: 
>> java.lang.NumberFormatException: For input string: 
>> "128...@^@^...@^@^...@^@^...@^@"
>>         at 
>> java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
>>         at java.lang.Long.parseLong(Long.java:419)
>>         at java.lang.Long.parseLong(Long.java:468)
>>         at 
>> org.apache.hadoop.hdfs.server.namenode.FSEditLog.readLong(FSEditLog.java:1355)
>>         at 
>> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:563)
>>         at 
>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022)
>>         ...
>> 
>> I tried removing the edits file altogether, but that failed with: 
>> java.io.IOException: Edits file is not found
>> 
>> I tried with a zero length edits file, so it would at least have a file 
>> there, but that results in an NPE: 
>> 
>> 2010-10-05 14:52:34,775 INFO org.apache.hadoop.hdfs.server.common.Storage: 
>> Edits file /mnt/name/current/edits of size 0 edits # 0 loaded in 0 seconds.
>> 2010-10-05 14:52:34,776 ERROR 
>> org.apache.hadoop.hdfs.server.namenode.NameNode: 
>> java.lang.NullPointerException
>>         at 
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1081)
>>         at 
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1093)
>>         at 
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:996)
>>         at 
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:199)
>> 
>> 
>> Most if not all the files I noticed in the edits file are temporary files 
>> that will be deleted once this thing gets back up and running anyway.    
>> There is a closed ticket that might be related: 
>> https://issues.apache.org/jira/browse/HDFS-686 ,  but the version I'm using 
>> seems to already have HDFS-686 (according to 
>> http://archive.cloudera.com/cdh/3/hadoop-0.20.2+320/changes.html)  
>> 
>> What do I have to do to get back up and running?
>> 
>> Thank you for your help, 
>> 
>> Matthew
>> 
>> 
>> 
>> 
>> 
>> -- 
>> Todd Lipcon
>> Software Engineer, Cloudera
> 
> 
> 
> 
> -- 
> Todd Lipcon
> Software Engineer, Cloudera

Re: NameNode crash - cannot start dfs - need help

Reply via email to