Thanks, I agree I need to upgrade :)

I was able to recover NN following your suggestions, and an additional
hack was to sync the namespaceID across data nodes with the namenode.

On May 14, 2012, at 11:48 AM, Harsh J <ha...@cloudera.com> wrote:

> True, I don't recall 0.20.2 (the original release that was a few years
> ago) carrying these fixes. You ought to upgrade that cluster to the
> current stable release for the many fixes you can benefit from :)
>
> On Mon, May 14, 2012 at 11:58 PM, Prashant Kommireddi
> <prash1...@gmail.com> wrote:
>> Thanks Harsh. I am using 0.20.2, I see on the Jira this issue was
>> fixed for 0.23?
>>
>> I will try out your suggestions and get back.
>>
>> On May 14, 2012, at 1:22 PM, Harsh J <ha...@cloudera.com> wrote:
>>
>>> Your fsimage seems to have gone bad (is it 0-sized? I recall that as a
>>> known issue long since fixed).
>>>
>>> The easiest way is to fall back to the last available good checkpoint
>>> (From SNN). Or if you have multiple dfs.name.dirs, see if some of the
>>> other points have better/complete files on them, and re-spread them
>>> across after testing them out (and backing up the originals).
>>>
>>> Though what version are you running? Cause AFAIK most of the recent
>>> stable versions/distros include NN resource monitoring threads which
>>> should have placed your NN into safemode the moment all its disks ran
>>> near to out of space.
>>>
>>> On Mon, May 14, 2012 at 10:50 PM, Prashant Kommireddi
>>> <prash1...@gmail.com> wrote:
>>>> Hi,
>>>>
>>>> I am seeing an issue where Namenode does not start due an EOFException. The
>>>> disk was full and I cleared space up but I am unable to get past this
>>>> exception. Any ideas on how this can be resolved?
>>>>
>>>> 2012-05-14 10:10:44,018 INFO
>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=hadoop
>>>> 2012-05-14 10:10:44,018 INFO
>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
>>>> isPermissionEnabled=false
>>>> 2012-05-14 10:10:44,023 INFO
>>>> org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics:
>>>> Initializing FSNamesystemMetrics using context
>>>> object:org.apache.hadoop.metrics.file.FileContext
>>>> 2012-05-14 10:10:44,024 INFO
>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered
>>>> FSNamesystemStatusMBean
>>>> 2012-05-14 10:10:44,047 INFO org.apache.hadoop.hdfs.server.common.Storage:
>>>> Number of files = 205470
>>>> 2012-05-14 10:10:44,844 ERROR
>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
>>>> initialization failed.
>>>> java.io.EOFException
>>>>    at java.io.DataInputStream.readFully(DataInputStream.java:180)
>>>>    at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106)
>>>>    at
>>>> org.apache.hadoop.hdfs.server.namenode.FSImage.readString(FSImage.java:1578)
>>>>    at
>>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:880)
>>>>    at
>>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
>>>>    at
>>>> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
>>>>    at
>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
>>>>    at
>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
>>>>    at
>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
>>>>    at
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
>>>>    at
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
>>>>    at
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
>>>>    at
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
>>>> 2012-05-14 10:10:44,845 INFO org.apache.hadoop.ipc.Server: Stopping server
>>>> on 54310
>>>> 2012-05-14 10:10:44,845 ERROR
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.EOFException
>>>>    at java.io.DataInputStream.readFully(DataInputStream.java:180)
>>>>    at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106)
>>>>    at
>>>> org.apache.hadoop.hdfs.server.namenode.FSImage.readString(FSImage.java:1578)
>>>>    at
>>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:880)
>>>>    at
>>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
>>>>    at
>>>> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
>>>>    at
>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
>>>>    at
>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
>>>>    at
>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
>>>>    at
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
>>>>    at
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
>>>>    at
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
>>>>    at
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
>>>>
>>>> 2012-05-14 10:10:44,846 INFO
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
>>>> /************************************************************
>>>> SHUTDOWN_MSG: Shutting down NameNode at
>>>> gridforce-1.internal.salesforce.com/10.0.201.159
>>>> ************************************************************/
>>>
>>>
>>>
>>> --
>>> Harsh J
>
>
>
> --
> Harsh J

Reply via email to