Sadly I am not writing it to multiple files. I will be now. Do you have a link on information to best practices in this regard?
The upside is, the jobs I was running were all expendable so I can afford to lose what was written out. Removing the edits file should only impact data I was writing, correct? Thank you Allen -----Original Message----- From: Allen Wittenauer [mailto:[email protected]] Sent: Monday, October 05, 2009 14:52 To: [email protected]; [email protected] Subject: Re: Recovering Corrupt FS Image on Amazon EBS On 10/5/09 11:41 AM, "Malcolm Matalka" <[email protected]> wrote: > In the event of an error, we bring all the instances down. I then tried > to rerun the job (bringing all the instances back up and then attaching > to EBS volumes) and the namenode will not come up. The logfile gives > the error at the bottom. What are my options here to recover the file > system? Your edits file is corrupt. You have some choices: A) if you ran a secondary and ran it frequently, hacking the edits off at the point of corruption will set the HDFS pretty close to the point of last run B) If you didn't run the secondary that often or you don't make that many changes, you may just want to ignore the edits file and bring up the HDFS without it. C) Check your other directory--you -are- writing fsimage and edits to two different dirs, right? The other edits file may be healthier. But I suspect you're looking at data loss. :( > 2009-10-05 14:20:07,451 ERROR > org.apache.hadoop.hdfs.server.namenode.NameNode: > java.lang.NumberFormatException: For input string: "" > > at > java.lang.NumberFormatException.forInputString(NumberFormatException.jav > a:48) > > at java.lang.Integer.parseInt(Integer.java:468) > > at java.lang.Short.parseShort(Short.java:120) > > at java.lang.Short.parseShort(Short.java:78) > > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.readShort(FSEditLog.jav > a:1261)
