Thanks guys, I did have a secondary name node running, so was able to recover 
from last checkpoint as Todd had suggested.
 
Cheers Arv

________________________________

From: Jakob Homan [mailto:[email protected]]
Sent: Mon 20/07/2009 4:53 PM
To: [email protected]
Subject: Re: Recovery following disk full



The oiv handles the fsimage fil but not the edits log, so it wouldn't
help in this case.  There has been talk about writing a similar tool for
the edits log but nothing has been decided.
Also, while the oiv will be included in 21, it works on images back to
18 (and maybe earlier).  It's standalone, so it doesn't need a cluster
or anything, just the fsimage file.
Option c will be very tricky and earns its place as the last-ditch effort.
-jg

Tom White wrote:
> Is this an area where the Offline Image Viewer might be able to help
> in the future? It's not available for 0.18.3, but seems like it would
> be possible to extend it as a tool to help with c) in Todd's
> description.
>
> Tom
>
> On Mon, Jul 20, 2009 at 8:30 PM, Todd Lipcon<[email protected]> wrote:
>> Hi Arv,
>>
>> It sounds like your edits log in dfs.name.dir is corrupted since one of its
>> records got cut off by the disk filling up. When trying to replay the edit
>> log, it tries to read the entirety of that record and hits the end of file
>> unexpectedly - hence the EOFException.
>>
>> Your options at this point are:
>>
>> a) If you have a second copy of dfs.name.dir, it should also have a second
>> "edits" file. If it's longer it's possible that that copy is not corrupted.
>> I'd back up both copies, then duplicate the longer edit log into both name
>> dirs and try to start the namenode.
>>
>> b) If you were running a secondary namenode, you should have a checkpoint of
>> the fsimage from a few hours before the failure. You can recover the fsimage
>> from there. You'll lose some time period's worth of metadata edits, but you
>> should be able to get the FS running again.
>>
>> c) Last ditch attempt is to attempt to truncate the edit log at the correct
>> offset such that you avoid the EOFException. To do this would probably
>> involve adding some logging statements to the FSEditLog replay so you can
>> see what the byte offset of the last record it's trying to read is, and then
>> truncating the edit log right before that offset. This is somewhat
>> complicated and I wouldn't attempt it unless you (a) really need the data
>> and (b) don't have any other option.
>>
>> -Todd
>>
>> On Mon, Jul 20, 2009 at 12:27 PM, Arv Mistry <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> I'm getting the following error in starting up the namenode.
>>>
>>> What happened was one of our disks filled up, we reclaimed the
>>> disk space and tried to restart the hadoop daemons but the name node
>>> is now not starting up.
>>>
>>> Does anybody have any clues how to recover from this? I've tried
>>> searching through the Jira reports but nothing obvious.
>>>
>>> Appreciate any input, thanks.
>>>
>>> Cheers Arv
>>>
>>> 2009-07-20 14:57:41,712 INFO org.apache.hadoop.dfs.NameNode:
>>> STARTUP_MSG:
>>> /************************************************************
>>> STARTUP_MSG: Starting NameNode
>>> STARTUP_MSG:   host = qa-cs1/192.168.0.54
>>> STARTUP_MSG:   args = []
>>> STARTUP_MSG:   version = 0.18.3-dev
>>> STARTUP_MSG:   build =  -r ; compiled by 'bamboo' on Mon Nov 10 15:58:40
>>> PST 2008
>>> ************************************************************/
>>> 2009-07-20 14:57:41,801 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
>>> Initializing RPC Metrics with hostName=NameNode, port=9000
>>> 2009-07-20 14:57:41,805 INFO org.apache.hadoop.dfs.NameNode: Namenode up
>>> at: 192.168.0.54/192.168.0.54:9000
>>> 2009-07-20 <http://192.168.0.54/192.168.0.54:9000%0A2009-07-20>14:57:41,808 
>>> INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
>>> Initializing JVM Metrics with processName=NameNode, sessionId=null
>>> 2009-07-20 14:57:41,816 INFO org.apache.hadoop.dfs.NameNodeMetrics:
>>> Initializing NameNodeMeterics using context
>>> object:org.apache.hadoop.metrics.spi.NullContext
>>> 2009-07-20 14:57:41,869 INFO org.apache.hadoop.fs.FSNamesystem:
>>> fsOwner=hadoopadmin,hadoopadmin
>>> 2009-07-20 14:57:41,869 INFO org.apache.hadoop.fs.FSNamesystem:
>>> supergroup=supergroup
>>> 2009-07-20 14:57:41,869 INFO org.apache.hadoop.fs.FSNamesystem:
>>> isPermissionEnabled=true
>>> 2009-07-20 14:57:41,877 INFO org.apache.hadoop.dfs.FSNamesystemMetrics:
>>> Initializing FSNamesystemMeterics using context
>>> object:org.apache.hadoop.metrics.spi.NullContext
>>> 2009-07-20 14:57:41,878 INFO org.apache.hadoop.fs.FSNamesystem:
>>> Registered FSNamesystemStatusMBean
>>> 2009-07-20 14:57:41,908 INFO org.apache.hadoop.dfs.Storage: Number of
>>> files = 1808
>>> 2009-07-20 14:57:42,153 INFO org.apache.hadoop.dfs.Storage: Number of
>>> files under construction = 1
>>> 2009-07-20 14:57:42,157 INFO org.apache.hadoop.dfs.Storage: Image file
>>> of size 256399 loaded in 0 seconds.
>>> 2009-07-20 14:57:42,167 ERROR
>>> org.apache.hadoop.dfs.LeaseManager:
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605290.data
>>> not found in lease.paths
>>> (=[/opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605294.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605298.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605303.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605328.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605335.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605337.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605340.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605346.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605401.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_180000_1248113605432.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_200
>>>  90720_180000_1248113605451.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_180000_1248113605464.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_180000_1248113605487.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_180000_1248113605499.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_180000_1248113605539.data])
>>> 2009-07-20 14:57:42,167 ERROR
>>> org.apache.hadoop.dfs.LeaseManager:
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605294.data
>>> not found in lease.paths
>>> (=[/opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605298.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605303.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605328.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605335.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605337.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605340.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605346.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605401.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_180000_1248113605432.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_180000_1248113605451.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_200
>>>  90720_180000_1248113605464.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_180000_1248113605487.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_180000_1248113605499.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_180000_1248113605539.data])
>>> 2009-07-20 14:57:42,169 ERROR
>>> org.apache.hadoop.dfs.LeaseManager:
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605298.data
>>> not found in lease.paths
>>> (=[/opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605303.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605328.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605335.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605337.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605340.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605346.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605401.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_180000_1248113605290.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_180000_1248113605294.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_180000_1248113605432.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_200
>>>  90720_180000_1248113605451.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_180000_1248113605464.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_180000_1248113605487.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_180000_1248113605499.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_180000_1248113605539.data])
>>> 2009-07-20 14:57:42,169 ERROR
>>> org.apache.hadoop.dfs.LeaseManager:
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605303.data
>>> not found in lease.paths
>>> (=[/opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605328.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605335.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605337.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605340.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605346.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_170000_1248113605401.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_180000_1248113605290.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_180000_1248113605294.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_180000_1248113605432.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_180000_1248113605451.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_200
>>>  90720_180000_1248113605464.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_180000_1248113605487.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_180000_1248113605499.data,
>>> /opt/hadoop/data/disk1/cs/raw/20090720/cs_2_20090720_180000_1248113605539.data])
>>> 2009-07-20 14:57:42,171 ERROR org.apache.hadoop.fs.FSNamesystem:
>>> FSNamesystem initialization failed.
>>> java.io.EOFException
>>>        at java.io.DataInputStream.readFully(DataInputStream.java:180)
>>>        at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106)
>>>        at org.apache.hadoop.dfs.FSImage.readString(FSImage.java:1368)
>>>        at
>>> org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:447)
>>>        at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:846)
>>>        at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:675)
>>>        at
>>> org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:289)
>>>        at
>>> org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:80)
>>>        at
>>> org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:296)
>>>        at
>>> org.apache.hadoop.dfs.FSNamesystem.<init>(FSNamesystem.java:275)
>>>
>>>
>



Reply via email to