Todd- Thanks for your reply. I went out on a limb and started digging in the source code and figures it was FSImage. So I saved it, and copied over the copy from my checkpoint directory and got running again.
I ran a few jobs to test and returned to getting a problem new node running. Once again it looks like I will have to manually force an exit from safe mode to run fsck -move I sent mail to Harsh earlier - I think I must migrate to CDH as I fear my manual hacking with configs and such has caused the fragile state that the cluster is in now. Thanks, Terry On 05/18/2012 12:34 PM, Todd Lipcon wrote: > Hi Terry, > > It seems like something got truncated in your FSImage... though it's > unclear how that might have happened. > > If you're able to share your logs and your dfs.name.dir contents, feel > free to contact me off-list and I can try to take a look to diagnose > the issue and try to recover the system. Of course whenever any > corruption issue occurs we take it seriously and want to get at a root > cause to prevent future occurrences! > > Thanks > -Todd > > On Fri, May 18, 2012 at 6:57 AM, Terry Healy <the...@bnl.gov> wrote: >> Sorry, forgot to attach the trace: >> <code> >> 2012-05-18 09:54:45,355 INFO >> org.apache.hadoop.hdfs.server.common.Storage: Number of files = 128 >> 2012-05-18 09:54:45,379 ERROR >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem >> initialization failed. >> java.io.EOFException >> at java.io.DataInputStream.readFully(DataInputStream.java:180) >> at org.apache.hadoop.io.UTF8.readFields(UTF8.java:112) >> at >> org.apache.hadoop.hdfs.server.namenode.FSImage.readString(FSImage.java:1808) >> at >> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:901) >> at >> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:824) >> at >> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:372) >> at >> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100) >> at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388) >> at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362) >> at >> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276) >> at >> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496) >> at >> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279) >> at >> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288) >> 2012-05-18 09:54:45,380 ERROR >> org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.EOFException >> at java.io.DataInputStream.readFully(DataInputStream.java:180) >> at org.apache.hadoop.io.UTF8.readFields(UTF8.java:112) >> at >> org.apache.hadoop.hdfs.server.namenode.FSImage.readString(FSImage.java:1808) >> at >> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:901) >> at >> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:824) >> at >> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:372) >> at >> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100) >> at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388) >> at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362) >> at >> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276) >> at >> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496) >> at >> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279) >> at >> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288) >> >> 2012-05-18 09:54:45,380 INFO >> org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG: >> /************************************************************ >> SHUTDOWN_MSG: Shutting down NameNode at abcd/1xx.1xx.2xx.3xx >> ************************************************************/ >> >> </code> >> >> >> >> On 05/18/2012 09:51 AM, Terry Healy wrote: >>> Running Apache 1.0.2 ~12 datanodes >>> >>> Ran FSCK / -> OK, before, everything running as expected. >>> >>> Started trying to use a script to assign nodes to racks, which required >>> several stop-dfs.sh / start-dfs.sh cycles. (with some stop-all.sh / >>> start-all.sh too if that matters. >>> >>> Got past errors in script and data file, but dfsadmin -report still >>> showed all assigned to default rack. I tried replacing one system name >>> in the rack mapping file with it's IP address. At this point the NN >>> failed to start up. >>> >>> So I commented out the topology.script.file.name property statements in >>> hdfs-site.xml >>> >>> NN still fails to start; trace below indicating EOF Exception, but I >>> don't know what file it can't read. >>> >>> As always your patience with a noob appreciated; any suggestions to get >>> started again? (I can forget about the rack assignment for now) >>> >>> Thanks. >>> >>> >> >> > > > -- Terry Healy / the...@bnl.gov Cyber Security Operations Brookhaven National Laboratory Building 515, Upton N.Y. 11973