Hi Wei, This is an interesting bug - I haven't seen it before in the 0.20 series.
Is your namespace data confidential? If not, would it be possible to upload a bzip2 of your NN storage directory somewhere publically accessible? Note that it would not include any file contents, just filenames. Thanks -Todd On Thu, Jul 22, 2010 at 11:46 PM, Wei Zeng <wz...@shendusou.com> wrote: > > Hello > > It seems I should send this mail here, got no response from common-user > mail list yesterday. > > I have a hadoop(0.20.2) cluster running with 12 nodes . Today, when I > checking the log of secondaryname node , I found some exceptions like: > > *2010-07-22 00:14:28,042 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of transactions: > 0 Total time for transactions(ms): 0Number of trans > actions batched in Syncs: 0 Number of syncs: 0 SyncTimes(ms): 0 > 2010-07-22 00:14:28,051 INFO org.apache.hadoop.hdfs.server.common.Storage: > Recovering storage directory /data1/hadoop-0.20.0-cozy/dfs/namesecondary > from fail > ed checkpoint. > 2010-07-22 00:14:28,784 INFO > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Downloaded file > fsimage size 34054802 bytes. > 2010-07-22 00:14:28,803 INFO > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Downloaded file > edits size 1256592 bytes. > 2010-07-22 00:14:28,878 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: > fsOwner=cozy,cozy,adm,dialout,cdrom,plugdev,lpadmin,sambashare,admin > 2010-07-22 00:14:28,878 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup > 2010-07-22 00:14:28,878 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: > isPermissionEnabled=true > 2010-07-22 00:14:28,879 INFO org.apache.hadoop.hdfs.server.common.Storage: > Number of files = 258478 > 2010-07-22 00:14:35,019 INFO org.apache.hadoop.hdfs.server.common.Storage: > Number of files under construction = 7 > 2010-07-22 00:14:35,206 INFO org.apache.hadoop.hdfs.server.common.Storage: > Edits file /data1/hadoop-0.20.0-cozy/dfs/namesecondary/current/edits of size > 12565 > 92 edits # 12121 loaded in 0 seconds. > 2010-07-22 00:14:35,895 ERROR > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in > doCheckpoint: > 2010-07-22 00:14:35,896 ERROR > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: > java.io.IOException: saveLeases found path /mantis_file/antis_2010071 > 3/web_page/wiley_intsci/0/wiley_intsci_0_0_0_0_0_3_2_0.yaml but no matching > entry in namespace. > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveFilesUnderConstruction(FSNamesystem.java:4735) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImage(FSImage.java:1029) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImage(FSImage.java:1050) > at > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.doMerge(SecondaryNameNode.java:592) > at > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.access$000(SecondaryNameNode.java:473) > at > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doMerge(SecondaryNameNode.java:350) > at > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:314) > at > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:225) > at java.lang.Thread.run(Thread.java:619)* > > Then , I closed the whole cluster and restarted namenode. Hoped it can > remove this exception. however, the name node failed when initializing, the > log is: > > *10/07/23 00:49:40 INFO namenode.NameNode: STARTUP_MSG: > /************************************************************ > STARTUP_MSG: Starting NameNode > STARTUP_MSG: host = dev1/192.168.18.3 > STARTUP_MSG: args = [] > STARTUP_MSG: version = 0.20.2 > STARTUP_MSG: build = > https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r > 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010 > ************************************************************/ > 10/07/23 00:49:40 INFO metrics.RpcMetrics: Initializing RPC Metrics with > hostName=NameNode, port=9000 > 10/07/23 00:49:40 INFO namenode.NameNode: Namenode up at: > dev1.shendusou.com/192.168.18.3:9000 > 10/07/23 00:49:40 INFO jvm.JvmMetrics: Initializing JVM Metrics with > processName=NameNode, sessionId=null > 10/07/23 00:49:40 INFO metrics.NameNodeMetrics: Initializing > NameNodeMeterics using context > object:org.apache.hadoop.metrics.spi.NullContext > 10/07/23 00:49:40 INFO namenode.FSNamesystem: > fsOwner=cozy,cozy,adm,dialout,cdrom,plugdev,lpadmin,sambashare,admin > 10/07/23 00:49:40 INFO namenode.FSNamesystem: supergroup=supergroup > 10/07/23 00:49:40 INFO namenode.FSNamesystem: isPermissionEnabled=true > 10/07/23 00:49:40 INFO metrics.FSNamesystemMetrics: Initializing > FSNamesystemMetrics using context > object:org.apache.hadoop.metrics.spi.NullContext > 10/07/23 00:49:40 INFO namenode.FSNamesystem: Registered > FSNamesystemStatusMBean > 10/07/23 00:49:40 INFO common.Storage: Number of files = 258478 > 10/07/23 00:49:46 INFO common.Storage: Number of files under construction = > 7 > 10/07/23 00:49:46 INFO common.Storage: Image file of size 34054802 loaded > in 5 seconds. > 10/07/23 00:49:47 INFO common.Storage: Edits file > /data1/hadoop-0.20.0-cozy/dfs/name/current/edits of size 1256592 edits # > 12121 loaded in 0 seconds. > 10/07/23 00:50:04 INFO namenode.FSNamesystem: Invalid opcode, reached end > of edit log Number of transactions found 958197 > 10/07/23 00:50:04 INFO common.Storage: Edits file > /data1/hadoop-0.20.0-cozy/dfs/name/current/edits.new of size 104297870 edits > # 958197 loaded in 16 seconds. > 10/07/23 00:50:05 ERROR namenode.FSNamesystem: FSNamesystem initialization > failed. > java.io.IOException: saveLeases found path > /mantis_file/antis_20100713/web_page/wiley_intsci/0/wiley_intsci_0_0_0_0_0_3_2_0.yaml > but no matching entry in namespace. > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveFilesUnderConstruction(FSNamesystem.java:4735) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImage(FSImage.java:1029) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImage(FSImage.java:1050) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:88) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965) > 10/07/23 00:50:05 INFO ipc.Server: Stopping server on 9000 > 10/07/23 00:50:05 ERROR namenode.NameNode: java.io.IOException: saveLeases > found path > /mantis_file/antis_20100713/web_page/wiley_intsci/0/wiley_intsci_0_0_0_0_0_3_2_0.yaml > but no matching entry in namespace. > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveFilesUnderConstruction(FSNamesystem.java:4735) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImage(FSImage.java:1029) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImage(FSImage.java:1050) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:88) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965) > > 10/07/23 00:50:05 INFO namenode.NameNode: SHUTDOWN_MSG: * > > > I found some JiRAs about it (HADOOP-3418, HADOOP-3724) ,however, These > issues are reported in 0.18 and 0.19 , seems all are fixed before 0.20.0, > does it come back in 0.20.2? > > I tried to recover it from the checkpoint in secondary name node, the > problem is , checkpoint is empty. > > Any one can tell me how to recover the name node with current fsimage, > edits and edits.new(or others on namenode) ? lose some data is ok , lose > all is terrible > > Is there a better way to recover the cluster besides formatting namenode? > > Any idea is greatly appreciated! > > -- > wzeng > -- Todd Lipcon Software Engineer, Cloudera
