[
https://issues.apache.org/jira/browse/HDFS-8011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388332#comment-14388332
]
fujie commented on HDFS-8011:
-----------------------------
HDFS-6825 affects version is 2.5.0, but our hadoop version is 2.3.0. So are you
sure it is the same issue?
1. I am sure that the file was deleted. And I have some new findings.
Such as we have "image-file-1", "editlog-file-1" and "editlog-file-inprogress"
when start the standby namenode A.
I found the below behavior of these files:
step-1) SNN will load "image-file-1" and "editlog-file-1" and generate new
image file, take it as "image-file-2".
step-2) SNN will cp "image-file-2" to ative namenode.
step-3) "editlog-file-inprogress" will be renamed to "editlog-file-2" and a new
"editlog-file-inprogress" will be opened.
step-4) SNN will load "editlog-file-2", at the same time datanode will report
heartbeat to both active and standby.
The crash happends at step-4. We print all the failed files and all of them are
in "editlog-file-2".
We alse have a statistics, 20,000 operations failed in 500,000 operations. Then
we parsed "editlog-file-2", and got the familar contents
of failed records. All of them, RPC_CLIENTID is null(blank) , and RPC_CALLID is
-2.
<RECORD>
<OPCODE>OP_ADD_BLOCK</OPCODE>
<DATA>
<TXID>7660428426</TXID>
<PATH>/workspace/dm/recommend/VideoQuality/VRII/AppList/data/interactivedata_month/_temporary/1/_temporary/attempt_1427018831005_178665_r_000002_0/part
-r-00002</PATH>
<BLOCK>
<BLOCK_ID>2107099231</BLOCK_ID>
<NUM_BYTES>0</NUM_BYTES>
<GENSTAMP>1100893452304</GENSTAMP>
</BLOCK>
<RPC_CLIENTID></RPC_CLIENTID>
<RPC_CALLID>-2</RPC_CALLID>
</DATA>
</RECORD>
2. If we restart SNN A again, "editlog-file-2" could be loaded correctly just
like "editlog-file-1" in last restart operation. It's weird.
Does the reported heartbeat impact its behavior? But the "load" process and
"report" process should asynchronous, isn't it?
We are looking forward to you reply.
> standby nn can't started
> ------------------------
>
> Key: HDFS-8011
> URL: https://issues.apache.org/jira/browse/HDFS-8011
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: ha
> Affects Versions: 2.3.0
> Environment: centeros 6.2 64bit
> Reporter: fujie
>
> We have seen crash when starting the standby namenode, with fatal errors. Any
> solutions, workarouds, or ideas would be helpful for us.
> 1. Here is the context:
> At begining we have 2 namenodes, take A as active and B as standby. For
> some resons, namenode A was dead, so namenode B is working as active.
> When we try to restart A after a minute, it can't work. During this
> time a lot of files were put to HDFS, and a lot of files were renamed.
> Nodenode A crashed when "awaiting reported blocks in safemode" each
> time.
>
> 2. We can see error log below:
> 1)2015-03-30 ERROR
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception
> on operation CloseOp [length=0, inodeId=0,
> path=/xxx/_temporary/xxx/part-r-00074.bz2, replication=3,
> mtime=1427699913947, atime=1427699081161, blockSize=268435456,
> blocks=[blk_2103131025_1100889495739], permissions=dm:dm:rw-r--r--,
> clientName=, clientMachine=, opCode=OP_CLOSE, txid=7632753612]
> java.lang.NullPointerException
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockInfoUnderConstruction.setGenerationStampAndVerifyReplicas(BlockInfoUnderConstruction.java:247)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockInfoUnderConstruction.commitBlock(BlockInfoUnderConstruction.java:267)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.forceCompleteBlock(BlockManager.java:639)
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.updateBlocks(FSEditLogLoader.java:813)
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:383)
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:209)
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:122)
> at
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:737)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:227)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:321)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$0(EditLogTailer.java:302)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:296)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:356)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1528)
> at
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:413)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:292)
>
> 2)2015-03-30 FATAL
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unknown error
> encountered while tailing edits. Shutting down standby N
> N.
> java.io.IOException: Failed to apply edit log operation AddBlockOp
> [path=/xxx/_temporary/xxx/part-m-00121,
> penultimateBlock=blk_2102331803_1100888911441,
> lastBlock=blk_2102661068_1100889009168, RpcClientId=, RpcCallId=-2]: error
> null
> at
> org.apache.hadoop.hdfs.server.namenode.MetaRecoveryContext.editLogLoaderPrompt(MetaRecoveryContext.java:94)
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:215)
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:122)
> at
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:737)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:227)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:321)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$0(EditLogTailer.java:302)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:296)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:356)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1528)
> at
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:413)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:292)
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)