Hi,

today, from one of our production cluster (not under extremely stress condition 
recently and scale limited to four nodes only), i found the fsimage and all 
name server metadata are missing and result in scheduled MR tasks fail. 
however, from name node log message, we don't see much useful trouble shooting 
info nor clue that could lead to this disaster. though it's easier to reform 
the metadata or recover from check pt, but i am curious what could be the 
possible reason result in this severe situation since we never read the same 
problem before from the other setup (identical h/w but with less load applied). 

our name node running raid1 and we don't read any h/w alarm from remote console 
nor system log. the only warning i found is the '/getimage: 
java.io.IOException: GetImage failed. java.lang.NullPointerException' io 
exception two days ago, could this related to the corruption? 


could expert kindly shed some light on this issue, thanks. 


Cheers,
Jason



- name node fail to start: 


12/07/09 10:49:57 INFO jvm.JvmMetrics: Initializing JVM Metrics with 
processName=NameNode, sessionId=null
12/07/09 10:49:57 INFO metrics.NameNodeMetrics: Initializing NameNodeMeterics 
using context object:org.apache.hadoop.metrics.ganglia.GangliaContext31
12/07/09 10:49:57 INFO util.GSet: VM type       = 64-bit
12/07/09 10:49:57 INFO util.GSet: 2% max memory = 2.45375 MB
12/07/09 10:49:57 INFO util.GSet: capacity      = 2^18 = 262144 entries
12/07/09 10:49:57 INFO util.GSet: recommended=262144, actual=262144
12/07/09 10:49:57 INFO namenode.FSNamesystem: fsOwner=hdfs
12/07/09 10:49:57 INFO namenode.FSNamesystem: supergroup=supergroup
12/07/09 10:49:57 INFO namenode.FSNamesystem: isPermissionEnabled=true
12/07/09 10:49:57 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=1000
12/07/09 10:49:57 INFO namenode.FSNamesystem: isAccessTokenEnabled=false 
accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
12/07/09 10:49:57 INFO metrics.FSNamesystemMetrics: Initializing 
FSNamesystemMetrics using context 
object:org.apache.hadoop.metrics.ganglia.GangliaContext31
12/07/09 10:49:57 ERROR namenode.FSNamesystem: FSNamesystem initialization 
failed.
java.io.IOException: NameNode is not formatted.
    at 
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:329)
    at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:99)
    at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:358)
    at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:327)
    at 
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:271)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:465)
    at 
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1239)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1248)
12/07/09 10:49:57 ERROR namenode.NameNode: java.io.IOException: NameNode is not 
formatted.
    at 
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:329)
    at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:99)
    at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:358)
    at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:327)
    at 
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:271)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:465)
    at 
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1239)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1248)



- earlier msg: 


2012-07-07 00:03:35,463 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from 
192.168.70.18
2012-07-07 00:03:35,465 WARN org.mortbay.log: /getimage: java.io.IOException: 
GetImage failed. java.lang.NullPointerException
        at 
org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:83)
        at 
org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:78)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:416)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
        at 
org.apache.hadoop.hdfs.server.namenode.GetImageServlet.doGet(GetImageServlet.java:78)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
        at 
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
        at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
        at 
org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:829)
        at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
        at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
        at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
        at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
        at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
        at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
        at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
        at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
        at org.mortbay.jetty.Server.handle(Server.java:326)
        at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
        at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
        at 
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
        at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)

2012-07-07 00:05:01,122 INFO org.apache.hadoop.hdfs.StateChange: *DIR* 
NameNode.reportBadBlocks
2012-07-07 00:05:01,122 INFO org.apache.hadoop.hdfs.StateChange: BLOCK 
NameSystem.addToCorruptReplicasMap: blk_1866785886679796114 added as corrupt on 
192.168.70.19:50010 by /192.168.70.19
2012-07-07 00:05:57,707 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 
192.168.70.20:50010 to replicate blk_-5834163689094324395_5939 to datanode(s) 
192.168.70.19:50010 192.168.70.21:50010
2012-07-07 00:05:57,851 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
NameSystem.addStoredBlock: blockMap updated: 192.168.70.19:50010 is added to 
blk_-5834163689094324395_5939 size 142551
2012-07-07 00:05:57,887 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
NameSystem.addStoredBlock: blockMap updated: 192.168.70.21:50010 is added to 
blk_-5834163689094324395_5939 size 142551
2012-07-07 00:06:00,707 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 
192.168.70.21:50010 to replicate blk_-269424213420340184_5940 to datanode(s) 
192.168.70.19:50010



Reply via email to