Hi.

We are using a cluster of 2 computers (1 namenode and 2 secondarynodes) to store a large number of text files in the HDFS. The process had been running for atleast a couple of weeks when suddenly due to some power failure, the server got reset. So, in effect, the HDFS didn't stop cleanly. When I tried to restart the cluster, I got a Null Pointer Exception, with the following stack trace (from the logs).

2011-05-18 06:57:39,313 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=NameNode, port=YYYYY 2011-05-18 06:57:39,321 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at: master/172.XXX.XXX.XXX:YYYYY 2011-05-18 06:57:39,326 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=NameNode, sessionId=null 2011-05-18 06:57:39,329 INFO org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics: Initializing NameNodeMeterics using context object:org.apache.hadoop.metrics.spi.NullContext 2011-05-18 06:57:39,444 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=vishaal,vishaal 2011-05-18 06:57:39,444 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup 2011-05-18 06:57:39,444 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=true 2011-05-18 06:57:39,459 INFO org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics: Initializing FSNamesystemMetrics using context object:org.apache.hadoop.metrics.spi.NullContext 2011-05-18 06:57:39,461 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered FSNamesystemStatusMBean 2011-05-18 06:57:39,521 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files = 1 2011-05-18 06:57:39,531 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files under construction = 0 2011-05-18 06:57:39,531 INFO org.apache.hadoop.hdfs.server.common.Storage: Image file of size 97 loaded in 0 seconds. 2011-05-18 06:57:39,532 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits file /home/vishaal/hadoop-0.20.2/tmp/dfs/name/current/edits of size 0 edits # 0 loaded in 0 seconds. 2011-05-18 06:57:39,535 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1320) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1309) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:776) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:997) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:812) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)

2011-05-18 06:57:39,537 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at 172.XXX.XXX.XXX
************************************************************/

Though this was just an experiment to test the reliability of the HDFS storage, I would love to get it running again. This is, of course, hoping that the data could be recovered (if it is corrupted). A couple of more questions:

   * Is this a common problem? Is there any available patch? (Although
     I couldn't get after a lot of Googling).
   * If the servers are prone to power failures, is it a good choice to
     continue with HDFS for storage of data?
   * If this occurs, does it mean that all the data is corrupt? Does it
     mean not all but some data is corrupt? Can the corrupted data be
     recovered?

Would appreciate a prompt reply as this was an attempt to prove the concept of using distributed file system to store large amount of text as opposed to a relational database. (I hope you understand that I am on the line of fire).

Thanks in advance.
Vishaal Jatav.
(vishaal[dot]iitb04[at]gmail[dot]com)

Reply via email to