[
https://issues.apache.org/jira/browse/HDFS-15746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17254245#comment-17254245
]
Konstantin Shvachko commented on HDFS-15746:
--------------------------------------------
[~hexiaoqiao] thanks for reporting this. Which version of Hadoop do you see
this with?
I agree with [~elgoiri] it would be really good to understand the root cause,
so that we could add a unit test.
> Standby NameNode crash when replay editlog
> ------------------------------------------
>
> Key: HDFS-15746
> URL: https://issues.apache.org/jira/browse/HDFS-15746
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: namenode
> Reporter: Xiaoqiao He
> Assignee: Xiaoqiao He
> Priority: Major
> Attachments: HDFS-15746.001.patch
>
>
> Standby NameNode meet NPE and crash when replay editlog, After dig log and
> source code, Not found the root cause. But some information may be useful for
> this case.
> a. before SBN crash, ANN do one lease recovery.
> {code:java}
> 2020-12-23 12:37:45,946 WARN org.apache.hadoop.hdfs.StateChange: DIR*
> NameSystem.internalReleaseLease: $PATH has not been closed. Lease recovery is
> in progress. RecoveryId = 21696709510 for block blk_*_21658833701
> {code}
> b. then one Datanode Volumn failed which manage one replica of
> blk_*_21658833701 after lease recovery.
> c. after half one hour, SBN crash because NPE as following.
> {code:java}
> 2020-12-23 13:13:36,703 ERROR
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception
> on operation CloseOp [length=0, inodeId=0, path=$PATH, replication=3,
> mtime=1608698268201, atime=1608343529481, blockSize=268435456,
> blocks=[blk_$i_$j], permissions=user:group:rw-r--r--, aclEntries=null,
> clientName=, clientMachine=, overwrite=false, storagePolicyId=0,
> opCode=OP_CLOSE, txid=$txid]
> java.lang.NullPointerException
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.setGenerationStampAndVerifyReplicas(BlockInfo.java:455)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.commitBlock(BlockInfo.java:476)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.forceCompleteBlock(BlockManager.java:1248)
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.updateBlocks(FSEditLogLoader.java:1065)
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442)
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:244)
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:152)
> at
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:843)
> at
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:824)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:232)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:331)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:284)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:301)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:360)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1706)
> at
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:428)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:297)
> 2020-12-23 13:13:36,703 ERROR org.apache.hadoop.ipc.Server: Error in Reader
> java.nio.channels.ClosedChannelException
> at
> java.nio.channels.spi.AbstractSelectableChannel.register(AbstractSelectableChannel.java:197)
> at
> org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:1053)
> at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:1034)
> 2020-12-23 13:13:36,703 INFO BlockStateChange: BLOCK* addStoredBlock:
> blockMap updated: 10.16.39.26:50010 is added to blk_22374572883_21672067156
> size 58762255
> 2020-12-23 13:13:36,704 FATAL
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unknown error
> encountered while tailing edits. Shutting down standby NN.
> java.io.IOException: java.lang.NullPointerException
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:254)
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:152)
> at
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:843)
> at
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:824)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:232)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:331)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:284)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:301)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:360)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1706)
> at
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:428)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:297)
> Caused by: java.lang.NullPointerException
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.setGenerationStampAndVerifyReplicas(BlockInfo.java:455)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.commitBlock(BlockInfo.java:476)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.forceCompleteBlock(BlockManager.java:1248)
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.updateBlocks(FSEditLogLoader.java:1065)
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442)
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:244)
> ... 12 more
> {code}
> Not very clear about the relation between [lease recovery/volumn failed/sbn
> crash], but I think we should catch null when remove stale Replicas to avoid
> this fatal.
> Our production version is 2.*, and IMO this issue also exist at trunk.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]