[
https://issues.apache.org/jira/browse/HDFS-8011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388618#comment-14388618
]
Yongjun Zhang commented on HDFS-8011:
-------------------------------------
Hi [~fujie],
"I am sure that the file was deleted", if a file was deleted, and it still has
an OP_CLOSE in the edit log file, then why "If we restart SNN A again,
"editlog-file-2" could be loaded correctly just like "editlog-file-1" in last
restart operation." is indeed mysterious, unless OP_CLOSE silently ignores
deleted file.
Can we dump the edit log with oev tool, and see if the involved file in
OP_CLOSE operation that throws NPE was deleted (either it OR its parent has an
OP_DELETE) before the OP_CLOSE?
What it means by "20,000 operations failed in 500,000 operations"? what are
the error symptom? As Vinayakumar requested, can we analysze the trace stack of
all failures to see if they have the same exception stack?
Since you mentioned one problem OP_ADD_BLOCK, it seems that we are adding block
to a deleted file? If it's deleted file, I think it's very likely related to
delayed block removal, which relates to "at the same time datanode will report
heartbeat to both active and standby".
Thanks.
> standby nn can't started
> ------------------------
>
> Key: HDFS-8011
> URL: https://issues.apache.org/jira/browse/HDFS-8011
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: ha
> Affects Versions: 2.3.0
> Environment: centeros 6.2 64bit
> Reporter: fujie
>
> We have seen crash when starting the standby namenode, with fatal errors. Any
> solutions, workarouds, or ideas would be helpful for us.
> 1. Here is the context:
> At begining we have 2 namenodes, take A as active and B as standby. For
> some resons, namenode A was dead, so namenode B is working as active.
> When we try to restart A after a minute, it can't work. During this
> time a lot of files were put to HDFS, and a lot of files were renamed.
> Nodenode A crashed when "awaiting reported blocks in safemode" each
> time.
>
> 2. We can see error log below:
> 1)2015-03-30 ERROR
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception
> on operation CloseOp [length=0, inodeId=0,
> path=/xxx/_temporary/xxx/part-r-00074.bz2, replication=3,
> mtime=1427699913947, atime=1427699081161, blockSize=268435456,
> blocks=[blk_2103131025_1100889495739], permissions=dm:dm:rw-r--r--,
> clientName=, clientMachine=, opCode=OP_CLOSE, txid=7632753612]
> java.lang.NullPointerException
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockInfoUnderConstruction.setGenerationStampAndVerifyReplicas(BlockInfoUnderConstruction.java:247)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockInfoUnderConstruction.commitBlock(BlockInfoUnderConstruction.java:267)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.forceCompleteBlock(BlockManager.java:639)
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.updateBlocks(FSEditLogLoader.java:813)
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:383)
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:209)
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:122)
> at
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:737)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:227)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:321)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$0(EditLogTailer.java:302)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:296)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:356)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1528)
> at
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:413)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:292)
>
> 2)2015-03-30 FATAL
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unknown error
> encountered while tailing edits. Shutting down standby N
> N.
> java.io.IOException: Failed to apply edit log operation AddBlockOp
> [path=/xxx/_temporary/xxx/part-m-00121,
> penultimateBlock=blk_2102331803_1100888911441,
> lastBlock=blk_2102661068_1100889009168, RpcClientId=, RpcCallId=-2]: error
> null
> at
> org.apache.hadoop.hdfs.server.namenode.MetaRecoveryContext.editLogLoaderPrompt(MetaRecoveryContext.java:94)
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:215)
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:122)
> at
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:737)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:227)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:321)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$0(EditLogTailer.java:302)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:296)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:356)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1528)
> at
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:413)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:292)
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)