Hi,
today, from one of our production cluster (not under extremely stress condition recently and scale limited to four nodes only), i found the fsimage and all name server metadata are missing and result in scheduled MR tasks fail. however, from name node log message, we don't see much useful trouble shooting info nor clue that could lead to this disaster. though it's easier to reform the metadata or recover from check pt, but i am curious what could be the possible reason result in this severe situation since we never read the same problem before from the other setup (identical h/w but with less load applied). our name node running raid1 and we don't read any h/w alarm from remote console nor system log. the only warning i found is the '/getimage: java.io.IOException: GetImage failed. java.lang.NullPointerException' io exception two days ago, could this related to the corruption? could expert kindly shed some light on this issue, thanks. Cheers, Jason - name node fail to start: 12/07/09 10:49:57 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=NameNode, sessionId=null 12/07/09 10:49:57 INFO metrics.NameNodeMetrics: Initializing NameNodeMeterics using context object:org.apache.hadoop.metrics.ganglia.GangliaContext31 12/07/09 10:49:57 INFO util.GSet: VM type = 64-bit 12/07/09 10:49:57 INFO util.GSet: 2% max memory = 2.45375 MB 12/07/09 10:49:57 INFO util.GSet: capacity = 2^18 = 262144 entries 12/07/09 10:49:57 INFO util.GSet: recommended=262144, actual=262144 12/07/09 10:49:57 INFO namenode.FSNamesystem: fsOwner=hdfs 12/07/09 10:49:57 INFO namenode.FSNamesystem: supergroup=supergroup 12/07/09 10:49:57 INFO namenode.FSNamesystem: isPermissionEnabled=true 12/07/09 10:49:57 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=1000 12/07/09 10:49:57 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s) 12/07/09 10:49:57 INFO metrics.FSNamesystemMetrics: Initializing FSNamesystemMetrics using context object:org.apache.hadoop.metrics.ganglia.GangliaContext31 12/07/09 10:49:57 ERROR namenode.FSNamesystem: FSNamesystem initialization failed. java.io.IOException: NameNode is not formatted. at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:329) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:99) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:358) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:327) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:271) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:465) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1239) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1248) 12/07/09 10:49:57 ERROR namenode.NameNode: java.io.IOException: NameNode is not formatted. at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:329) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:99) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:358) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:327) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:271) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:465) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1239) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1248) - earlier msg: 2012-07-07 00:03:35,463 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from 192.168.70.18 2012-07-07 00:03:35,465 WARN org.mortbay.log: /getimage: java.io.IOException: GetImage failed. java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:83) at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:78) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127) at org.apache.hadoop.hdfs.server.namenode.GetImageServlet.doGet(GetImageServlet.java:78) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) at org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:829) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) 2012-07-07 00:05:01,122 INFO org.apache.hadoop.hdfs.StateChange: *DIR* NameNode.reportBadBlocks 2012-07-07 00:05:01,122 INFO org.apache.hadoop.hdfs.StateChange: BLOCK NameSystem.addToCorruptReplicasMap: blk_1866785886679796114 added as corrupt on 192.168.70.19:50010 by /192.168.70.19 2012-07-07 00:05:57,707 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 192.168.70.20:50010 to replicate blk_-5834163689094324395_5939 to datanode(s) 192.168.70.19:50010 192.168.70.21:50010 2012-07-07 00:05:57,851 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 192.168.70.19:50010 is added to blk_-5834163689094324395_5939 size 142551 2012-07-07 00:05:57,887 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 192.168.70.21:50010 is added to blk_-5834163689094324395_5939 size 142551 2012-07-07 00:06:00,707 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 192.168.70.21:50010 to replicate blk_-269424213420340184_5940 to datanode(s) 192.168.70.19:50010