Xiaoqiao He created HDFS-15746:
----------------------------------
Summary: Standby NameNode crash when replay editlog
Key: HDFS-15746
URL: https://issues.apache.org/jira/browse/HDFS-15746
Project: Hadoop HDFS
Issue Type: Bug
Components: namenode
Reporter: Xiaoqiao He
Assignee: Xiaoqiao He
Standby NameNode meet NPE and crash when replay editlog, After dig log and
source code, Not found the root cause. But some information may be useful for
this case.
a. before SBN crash, ANN do one lease recovery.
{code:java}
2020-12-23 12:37:45,946 WARN org.apache.hadoop.hdfs.StateChange: DIR*
NameSystem.internalReleaseLease: $PATH has not been closed. Lease recovery is
in progress. RecoveryId = 21696709510 for block blk_*_21658833701
{code}
b. then one Datanode Volumn failed which manage one replica of
blk_*_21658833701 after lease recovery.
c. after half one hour, SBN crash because NPE as following.
{code:java}
2020-12-23 13:13:36,703 ERROR
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception
on operation CloseOp [length=0, inodeId=0, path=$PATH, replication=3,
mtime=1608698268201, atime=1608343529481, blockSize=268435456,
blocks=[blk_$i_$j], permissions=user:group:rw-r--r--, aclEntries=null,
clientName=, clientMachine=, overwrite=false, storagePolicyId=0,
opCode=OP_CLOSE, txid=$txid]
java.lang.NullPointerException
at
org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.setGenerationStampAndVerifyReplicas(BlockInfo.java:455)
at
org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.commitBlock(BlockInfo.java:476)
at
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.forceCompleteBlock(BlockManager.java:1248)
at
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.updateBlocks(FSEditLogLoader.java:1065)
at
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442)
at
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:244)
at
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:152)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:843)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:824)
at
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:232)
at
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:331)
at
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:284)
at
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:301)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:360)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1706)
at
org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:428)
at
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:297)
2020-12-23 13:13:36,703 ERROR org.apache.hadoop.ipc.Server: Error in Reader
java.nio.channels.ClosedChannelException
at
java.nio.channels.spi.AbstractSelectableChannel.register(AbstractSelectableChannel.java:197)
at
org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:1053)
at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:1034)
2020-12-23 13:13:36,703 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap
updated: 10.16.39.26:50010 is added to blk_22374572883_21672067156 size 58762255
2020-12-23 13:13:36,704 FATAL
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unknown error
encountered while tailing edits. Shutting down standby NN.
java.io.IOException: java.lang.NullPointerException
at
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:254)
at
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:152)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:843)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:824)
at
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:232)
at
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:331)
at
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:284)
at
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:301)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:360)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1706)
at
org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:428)
at
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:297)
Caused by: java.lang.NullPointerException
at
org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.setGenerationStampAndVerifyReplicas(BlockInfo.java:455)
at
org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.commitBlock(BlockInfo.java:476)
at
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.forceCompleteBlock(BlockManager.java:1248)
at
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.updateBlocks(FSEditLogLoader.java:1065)
at
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442)
at
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:244)
... 12 more
{code}
Not very clear about the relation between [lease recovery/volumn failed/sbn
crash], but I think we should catch null when remove stale Replicas to avoid
this fatal.
Our production version is 2.*, and IMO this issue also exist at trunk.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]