[ 
https://issues.apache.org/jira/browse/HDFS-15746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17254245#comment-17254245
 ] 

Konstantin Shvachko commented on HDFS-15746:
--------------------------------------------

[~hexiaoqiao] thanks for reporting this. Which version of Hadoop do you see 
this with?
I agree with [~elgoiri] it would be really good to understand the root cause, 
so that we could add a unit test. 

> Standby NameNode crash when replay editlog
> ------------------------------------------
>
>                 Key: HDFS-15746
>                 URL: https://issues.apache.org/jira/browse/HDFS-15746
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>            Reporter: Xiaoqiao He
>            Assignee: Xiaoqiao He
>            Priority: Major
>         Attachments: HDFS-15746.001.patch
>
>
> Standby NameNode meet NPE and crash when replay editlog, After dig log and 
> source code, Not found the root cause. But some information may be useful for 
> this case.
> a. before SBN crash, ANN do one lease recovery.
> {code:java}
> 2020-12-23 12:37:45,946 WARN org.apache.hadoop.hdfs.StateChange: DIR* 
> NameSystem.internalReleaseLease: $PATH has not been closed. Lease recovery is 
> in progress. RecoveryId = 21696709510 for block blk_*_21658833701
> {code}
> b. then one Datanode Volumn failed which manage one replica of 
> blk_*_21658833701 after lease recovery.
> c. after half one hour, SBN crash because NPE as following.
> {code:java}
> 2020-12-23 13:13:36,703 ERROR 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception 
> on operation CloseOp [length=0, inodeId=0, path=$PATH, replication=3, 
> mtime=1608698268201, atime=1608343529481, blockSize=268435456, 
> blocks=[blk_$i_$j], permissions=user:group:rw-r--r--, aclEntries=null, 
> clientName=, clientMachine=, overwrite=false, storagePolicyId=0, 
> opCode=OP_CLOSE, txid=$txid]
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.setGenerationStampAndVerifyReplicas(BlockInfo.java:455)
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.commitBlock(BlockInfo.java:476)
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.forceCompleteBlock(BlockManager.java:1248)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.updateBlocks(FSEditLogLoader.java:1065)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:244)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:152)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:843)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:824)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:232)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:331)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:284)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:301)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:360)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1706)
>         at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:428)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:297)
> 2020-12-23 13:13:36,703 ERROR org.apache.hadoop.ipc.Server: Error in Reader
> java.nio.channels.ClosedChannelException
>         at 
> java.nio.channels.spi.AbstractSelectableChannel.register(AbstractSelectableChannel.java:197)
>         at 
> org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:1053)
>         at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:1034)
> 2020-12-23 13:13:36,703 INFO BlockStateChange: BLOCK* addStoredBlock: 
> blockMap updated: 10.16.39.26:50010 is added to blk_22374572883_21672067156 
> size 58762255
> 2020-12-23 13:13:36,704 FATAL 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unknown error 
> encountered while tailing edits. Shutting down standby NN.
> java.io.IOException: java.lang.NullPointerException
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:254)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:152)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:843)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:824)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:232)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:331)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:284)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:301)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:360)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1706)
>         at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:428)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:297)
> Caused by: java.lang.NullPointerException
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.setGenerationStampAndVerifyReplicas(BlockInfo.java:455)
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.commitBlock(BlockInfo.java:476)
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.forceCompleteBlock(BlockManager.java:1248)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.updateBlocks(FSEditLogLoader.java:1065)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:244)
>         ... 12 more
> {code}
> Not very clear about the relation between [lease recovery/volumn failed/sbn 
> crash], but I think we should catch null when remove stale Replicas to avoid 
> this fatal. 
> Our production version is 2.*, and IMO this issue also exist at trunk.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to