Thanks, I agree I need to upgrade :) I was able to recover NN following your suggestions, and an additional hack was to sync the namespaceID across data nodes with the namenode.
On May 14, 2012, at 11:48 AM, Harsh J <ha...@cloudera.com> wrote: > True, I don't recall 0.20.2 (the original release that was a few years > ago) carrying these fixes. You ought to upgrade that cluster to the > current stable release for the many fixes you can benefit from :) > > On Mon, May 14, 2012 at 11:58 PM, Prashant Kommireddi > <prash1...@gmail.com> wrote: >> Thanks Harsh. I am using 0.20.2, I see on the Jira this issue was >> fixed for 0.23? >> >> I will try out your suggestions and get back. >> >> On May 14, 2012, at 1:22 PM, Harsh J <ha...@cloudera.com> wrote: >> >>> Your fsimage seems to have gone bad (is it 0-sized? I recall that as a >>> known issue long since fixed). >>> >>> The easiest way is to fall back to the last available good checkpoint >>> (From SNN). Or if you have multiple dfs.name.dirs, see if some of the >>> other points have better/complete files on them, and re-spread them >>> across after testing them out (and backing up the originals). >>> >>> Though what version are you running? Cause AFAIK most of the recent >>> stable versions/distros include NN resource monitoring threads which >>> should have placed your NN into safemode the moment all its disks ran >>> near to out of space. >>> >>> On Mon, May 14, 2012 at 10:50 PM, Prashant Kommireddi >>> <prash1...@gmail.com> wrote: >>>> Hi, >>>> >>>> I am seeing an issue where Namenode does not start due an EOFException. The >>>> disk was full and I cleared space up but I am unable to get past this >>>> exception. Any ideas on how this can be resolved? >>>> >>>> 2012-05-14 10:10:44,018 INFO >>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=hadoop >>>> 2012-05-14 10:10:44,018 INFO >>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: >>>> isPermissionEnabled=false >>>> 2012-05-14 10:10:44,023 INFO >>>> org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics: >>>> Initializing FSNamesystemMetrics using context >>>> object:org.apache.hadoop.metrics.file.FileContext >>>> 2012-05-14 10:10:44,024 INFO >>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered >>>> FSNamesystemStatusMBean >>>> 2012-05-14 10:10:44,047 INFO org.apache.hadoop.hdfs.server.common.Storage: >>>> Number of files = 205470 >>>> 2012-05-14 10:10:44,844 ERROR >>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem >>>> initialization failed. >>>> java.io.EOFException >>>> at java.io.DataInputStream.readFully(DataInputStream.java:180) >>>> at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106) >>>> at >>>> org.apache.hadoop.hdfs.server.namenode.FSImage.readString(FSImage.java:1578) >>>> at >>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:880) >>>> at >>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807) >>>> at >>>> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364) >>>> at >>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87) >>>> at >>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311) >>>> at >>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292) >>>> at >>>> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201) >>>> at >>>> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279) >>>> at >>>> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956) >>>> at >>>> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965) >>>> 2012-05-14 10:10:44,845 INFO org.apache.hadoop.ipc.Server: Stopping server >>>> on 54310 >>>> 2012-05-14 10:10:44,845 ERROR >>>> org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.EOFException >>>> at java.io.DataInputStream.readFully(DataInputStream.java:180) >>>> at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106) >>>> at >>>> org.apache.hadoop.hdfs.server.namenode.FSImage.readString(FSImage.java:1578) >>>> at >>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:880) >>>> at >>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807) >>>> at >>>> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364) >>>> at >>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87) >>>> at >>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311) >>>> at >>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292) >>>> at >>>> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201) >>>> at >>>> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279) >>>> at >>>> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956) >>>> at >>>> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965) >>>> >>>> 2012-05-14 10:10:44,846 INFO >>>> org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG: >>>> /************************************************************ >>>> SHUTDOWN_MSG: Shutting down NameNode at >>>> gridforce-1.internal.salesforce.com/10.0.201.159 >>>> ************************************************************/ >>> >>> >>> >>> -- >>> Harsh J > > > > -- > Harsh J