Wei-Chiu Chuang created HDFS-10729:
--------------------------------------
Summary: NameNode crashes when loading edits because max directory
items is exceeded
Key: HDFS-10729
URL: https://issues.apache.org/jira/browse/HDFS-10729
Project: Hadoop HDFS
Issue Type: Bug
Components: namenode
Affects Versions: 2.7.2
Reporter: Wei-Chiu Chuang
Assignee: Wei-Chiu Chuang
Priority: Critical
We encountered a bug where Standby NameNode crashes due to an NPE when loading
edits.
{noformat}
2016-08-05 15:06:00,983 ERROR
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception
on operation AddOp [length=0, inodeId=789272719, path=[path], replication=3,
mtime=1470379597935, atime=1470379597935, blockSize=134217728, blocks=[],
permissions=bdi:supergroup:rw-r--r--, aclEntries=null,
clientName=DFSClient_NONMAPREDUCE_1495395702_1, clientMachine=10.210.119.136,
overwrite=true, RpcClientId=a1512eeb-65e4-43dc-8aa8-d7a1af37ed30,
RpcCallId=417, storagePolicyId=0, opCode=OP_ADD, txid=4212503758]
java.lang.NullPointerException
at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.getFileEncryptionInfo(FSDirectory.java:2914)
at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.createFileStatus(FSDirectory.java:2469)
at
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:375)
at
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:230)
at
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:139)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:829)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:810)
at
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:232)
at
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:331)
at
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:284)
at
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:301)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:360)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1651)
at
org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:410)
at
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:297)
{noformat}
The NameNode crashes and can not be restarted. After some research, we turned
on debug log of org.apache.hadoop.hdfs.StateChange, restart the NN, and we saw
the following exception which induced NPE:
{noformat}
16/08/07 18:51:15 DEBUG hdfs.StateChange: DIR* FSDirectory.unprotectedAddFile:
exception when add [path] to the file system
org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException:
The directory item limit of [path] is exceeded: limit=1048576 items=1049332
at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxDirItems(FSDirectory.java:2060)
at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:2112)
at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.addLastINode(FSDirectory.java:2081)
at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.addINode(FSDirectory.java:1900)
at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:368)
at
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:365)
at
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:230)
at
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:139)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:829)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:810)
at
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:232)
at
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$1.run(EditLogTailer.java:188)
at
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$1.run(EditLogTailer.java:182)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at
org.apache.hadoop.security.SecurityUtil.doAsUser(SecurityUtil.java:445)
at
org.apache.hadoop.security.SecurityUtil.doAsLoginUser(SecurityUtil.java:426)
at
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.catchupDuringFailover(EditLogTailer.java:182)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1205)
at
org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1762)
at
org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
at
org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:64)
at
org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1635)
at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:1351)
at
org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
at
org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:4460)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
{noformat}
The exception is thrown, caught and logged at debug level in
{{FSDirectory#unprotectedAddFile}}. Afterwards, a null is returned, which is
used and subsequently caused NPE. HDFS-7567 reported a similar bug, but it was
deemed the NPE could only be thrown if edit is corrupt. However, here we see an
example where NPE could be thrown without corrupt edits.
Looks like the maximum number of items per directory is exceeded. This is
similar to HDFS-6102 and HDFS-7482, but this happens at loading edits, rather
than loading fsimage.
A possible workaround is to increas the value of
{{dfs.namenode.fs-limits.max-directory-items}} to 6400000. But I am not sure if
it would cause any side effects.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]