Hi,

I've hit a namenode corruption issue on one of my operational clusters and am 
unable to start the namenode process.  I have a potential recovery strategy but 
as the consequences of screwing up are quite severe I thought I'd get some more 
opinions before doing anything drastic.

The issues and proposed recovery are described below.  Could people please let 
me know if this has been seen before, does my recovery strategy look sane or is 
there something flawed in my understanding or logic, can you think of any 
better solution, etc.

I'm using Hadoop 0.20.10-Yahoo.

The corruption also affects the previous checkpoint produced by the secondary 
name node so reverting to this isn't an option.

The problem shows up when I try to start up the namenode process.  It fails 
with the following stack trace:

Java.io.IOException: saveLeases found path <path removed> but is not under 
construction.
  at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveFilesUnderConstruction(FSNamesystem.java:4800)
  at 
org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImage(FSImage.java:1029)
  at 
org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImage(FSImage.java:1050)
  at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:88)
  at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:312)
  at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:293)
  at 
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
  at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
  at 
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:958)
  at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:967)

I've looked through the source code to try and get my head around what happens 
when the name node is started and have come up with the following.
- As part of FSNamesystem.initialize the fsimage file is loaded from disk and 
used to build the in-memory copy used by the name node.
- The edits file is also loaded and all outstanding transactions are applied to 
the data structures.
- The updated data structures are then written to disk to create an up to date 
fsimage file.
- As part of writing the image to disk a check is made against outstanding 
client file leases.  It is this check that is throwing the exception, a lease 
has been added during the image load but the associated file is not marked as 
under construction.
- Investigating the structure of the fsimage file I believe it is made up of 3 
main sections: 1) header, 2) inode table (for want of a better term) and 3) 
file construction table.
- It is the file construction table that is used to persist file leases and 
therefore populate the lease manager at start up.
- A message logged before the exception shows that I have 1 file under 
construction.

My recovery plan is basically to modify the fsimage load process to skip the 
files under construction part of the file (losing this one file is acceptable). 
 Specifically this involves modifying 
org/apache/hadoop/hdfs/server/namenode/FSImage.java to comment out line 959 
(the call to loadFilesUnderConstruction).  This would then be built into a new 
deployment to be used to recover the fsimage file (startup the namenode, let it 
process the file and write it back to disk then shutdown the process).  Using a 
test system without any data nodes attached I have tried using the modified 
hadoop process to recover the fsimage file and it seems to have worked (checked 
using various ls and du commands), I haven't copied the recovered files back to 
the operational cluster yet.  Once I'm happy that this recovered file is the 
best solution available I'll copy it across to use on the operational cluster 
(which will remain using the original version of Hadoop).  Does this make 
sense, is it sane?

At the moment my priority is recovery but I'll be investigating cause in slower 
time.

Any thoughts?

Thanks,
Jon


Jonathan Allen
UKGP, NS&R, Defence and Security
HP Enterprise Services
Telephone +44 1684 291206
Email jonathan.allen...@hp.com<mailto:jonathan.allen...@hp.com>
Street address, HP Enterprise Services UK Ltd, Alexandra Way, Ashchurch 
Business Park, Tewkesbury, Gloucestershire. GL20 8NB

Hewlett-Packard Limited registered Office: Cain Road, Bracknell, Berks RG12 1HN
Registered No: 690597 England
The contents of this message and any attachments to it are confidential and may 
be legally privileged. If you have received this message in error, you should 
delete it from your system immediately and advise the sender.
To any recipient of this message within HP, unless otherwise stated you should 
consider this message and attachments as "HP CONFIDENTIAL".



Reply via email to