[jira] [Commented] (HDFS-8011) standby nn can't started

Yongjun Zhang (JIRA) Tue, 31 Mar 2015 07:43:18 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-8011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388618#comment-14388618
 ]


Yongjun Zhang commented on HDFS-8011:
-------------------------------------

Hi [~fujie],

"I am sure that the file was deleted", if a file was deleted, and it still has 
an OP_CLOSE in the edit log file, then why "If we restart SNN A again, 
"editlog-file-2" could be loaded correctly just like "editlog-file-1" in last 
restart operation." is indeed mysterious, unless OP_CLOSE silently ignores 
deleted file.

Can we dump the edit log with oev tool, and see if the involved file in 
OP_CLOSE operation that throws NPE was deleted (either it OR its parent has an 
OP_DELETE) before the OP_CLOSE? 

What it means by "20,000 operations failed in 500,000 operations"?  what are 
the error symptom? As Vinayakumar requested, can we analysze the trace stack of 
all failures to see if they have the same exception stack? 

Since you mentioned one problem OP_ADD_BLOCK, it seems that we are adding block 
to a deleted file? If it's deleted file, I think it's very likely related to 
delayed block removal, which relates to "at the same time datanode will report 
heartbeat to both active and standby".

Thanks.





> standby nn can't started
> ------------------------
>
>                 Key: HDFS-8011
>                 URL: https://issues.apache.org/jira/browse/HDFS-8011
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 2.3.0
>         Environment: centeros 6.2  64bit 
>            Reporter: fujie
>
> We have seen crash when starting the standby namenode, with fatal errors. Any 
> solutions, workarouds, or ideas would be helpful for us.
> 1. Here is the context: 
>       At begining we have 2 namenodes, take A as active and B as standby. For 
> some resons, namenode A was dead, so namenode B is working as active.
>       When we try to restart A after a minute, it can't work. During this 
> time a lot of files were put to HDFS, and a lot of files were renamed. 
>       Nodenode A crashed when "awaiting reported blocks in safemode" each 
> time.
>  
> 2. We can see error log below:
>       1)2015-03-30  ERROR 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception 
> on operation CloseOp [length=0, inodeId=0, 
> path=/xxx/_temporary/xxx/part-r-00074.bz2, replication=3, 
> mtime=1427699913947, atime=1427699081161, blockSize=268435456, 
> blocks=[blk_2103131025_1100889495739], permissions=dm:dm:rw-r--r--, 
> clientName=, clientMachine=, opCode=OP_CLOSE, txid=7632753612]
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockInfoUnderConstruction.setGenerationStampAndVerifyReplicas(BlockInfoUnderConstruction.java:247)
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockInfoUnderConstruction.commitBlock(BlockInfoUnderConstruction.java:267)
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.forceCompleteBlock(BlockManager.java:639)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.updateBlocks(FSEditLogLoader.java:813)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:383)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:209)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:122)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:737)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:227)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:321)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$0(EditLogTailer.java:302)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:296)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:356)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1528)
>         at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:413)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:292)
>         
>    2)2015-03-30  FATAL 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unknown error 
> encountered while tailing edits. Shutting down standby N
> N.
> java.io.IOException: Failed to apply edit log operation AddBlockOp 
> [path=/xxx/_temporary/xxx/part-m-00121, 
> penultimateBlock=blk_2102331803_1100888911441, 
> lastBlock=blk_2102661068_1100889009168, RpcClientId=, RpcCallId=-2]: error
> null
>         at 
> org.apache.hadoop.hdfs.server.namenode.MetaRecoveryContext.editLogLoaderPrompt(MetaRecoveryContext.java:94)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:215)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:122)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:737)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:227)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:321)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$0(EditLogTailer.java:302)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:296)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:356)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1528)
>         at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:413)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:292)
>         



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8011) standby nn can't started

Reply via email to