[
https://issues.apache.org/jira/browse/HDFS-11225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15969629#comment-15969629
]
Manoj Govindassamy edited comment on HDFS-11225 at 4/14/17 10:35 PM:
---------------------------------------------------------------------
*Problem:*
* Unlike {{INodeDirectory}}, {{DirectoryWithSnapshotFeature}} doesn't have its
children maintained in a plain list. Instead, there is {{DirectoryDiffList}}
which is a list of {{DirectoryDiff}}.
* On every new snapshot, {{DirectoryWithSnapshotFeature}} updates its diffList
with a new entry and all the subsequent file creation/deletions will be
recorded in the last taken snapshot.
* So, the snap diff list which {{DirectoryWithSnapshotFeature}} maintains is
more of Delta/Diff of file creation and deletions since the last snapshot. This
is deliberate design so as top keep the order of snapshot creation to a
constant.
* Snapshot deletion operation needs to visit all children files for the
snapshot to reclaim blocks, and
{{DirectoryWithSnapshotFeature#DirectoryDiff#getChildrenList()}} is invoked to
get the list.
* To get the children list for any snapshot {{Sx}} under a directory, all the
snapshot diff records after {{Sx}} are combined and reversed from the current
children list of the directory.
* So, listing children under a Snapshot {{Sx}} directory is the order of
(#Snapshots after Sx * #FileDiffs in each of those snapshots). With thousands
of snapshots and with 100s of thousands of files, this listing operations can
easily consume 10s of seconds.
* Above all, all these operations are done by a single threads and one
directory at a time, in a recursive fashion. In my testing, I have seen
snapshot deletion taking 45+ seconds with a fairly unloaded NN.
was (Author: manojg):
*Problem:*
* In order to get the children list for any given snapshot,
{{DirectoryWithSnapshotFeature#DirectoryDiff#getChildrenList()}} is invoked.
* Unlike {{INodeDirectory}}, {{DirectoryWithSnapshotFeature}} doesn't have its
children maintained in a plain list. Instead, there is {{DirectoryDiffList}}
which is a list of {{DirectoryDiff}}.
* On every new snapshot, {{DirectoryWithSnapshotFeature}} updates its diffList
with a new entry and all the subsequent file creation/deletions will be
recorded in the last taken snapshot.
* So, the snap diff list which {{DirectoryWithSnapshotFeature}} maintains is
more of Delta/Diff of file creation and deletions since the last snapshot. This
is deliberate design so as top keep the order of snapshot creation to a
constant.
* Snapshot deletion operation needs to visit all files for the snapshot to
reclaim blocks. To get the children list for any snapshot {{Sx}} under a
directory, all the snapshot diff records after {{Sx}} are combined and
reversed from the current children list of the directory.
* So, listing children under a Snapshot {{Sx}} directory is the order of
(#Snapshots after Sx * #FileDiffs in each of those snapshots). With thousands
of snapshots and with 100s of thousands of files, this listing operations can
easily consume 10s of seconds.
* Above all, all these operations are done by a single threads and one
directory at a time, in a recursive fashion. In my testing, I have seen
snapshot deletion taking 45+ seconds with a fairly unloaded NN.
> NameNode crashed because deleteSnapshot held FSNamesystem lock too long
> -----------------------------------------------------------------------
>
> Key: HDFS-11225
> URL: https://issues.apache.org/jira/browse/HDFS-11225
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: namenode
> Affects Versions: 2.4.0
> Environment: CDH5.8.2, HA
> Reporter: Wei-Chiu Chuang
> Assignee: Manoj Govindassamy
> Priority: Critical
> Labels: high-availability
>
> The deleteSnapshot operation is synchronous. In certain situations this
> operation may hold FSNamesystem lock for too long, bringing almost every
> NameNode operation to a halt.
> We have observed one incidence where it took so long that ZKFC believes the
> NameNode is down. All other IPC threads were waiting to acquire FSNamesystem
> lock. This specific deleteSnapshot took ~70 seconds. ZKFC has connection
> timeout of 45 seconds by default, and if all IPC threads wait for
> FSNamesystem lock and can't accept new incoming connection, ZKFC times out,
> advances epoch and NameNode will therefore lose its active NN role and then
> fail.
> Relevant log:
> {noformat}
> Thread 154 (IPC Server handler 86 on 8020):
> State: RUNNABLE
> Blocked count: 2753455
> Waited count: 89201773
> Stack:
>
> org.apache.hadoop.hdfs.server.namenode.INode$BlocksMapUpdateInfo.addDeleteBlock(INode.java:879)
>
> org.apache.hadoop.hdfs.server.namenode.INodeFile.destroyAndCollectBlocks(INodeFile.java:508)
>
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.destroyAndCollectBlocks(INodeDirectory.java:763)
>
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.destroyAndCollectBlocks(INodeDirectory.java:763)
>
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.destroyAndCollectBlocks(INodeDirectory.java:763)
>
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.destroyAndCollectBlocks(INodeDirectory.java:763)
>
> org.apache.hadoop.hdfs.server.namenode.INodeReference.destroyAndCollectBlocks(INodeReference.java:339)
>
> org.apache.hadoop.hdfs.server.namenode.INodeReference$WithName.destroyAndCollectBlocks(INodeReference.java:606)
>
> org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$ChildrenDiff.destroyDeletedList(DirectoryWithSnapshotFeature.java:119)
>
> org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$ChildrenDiff.access$400(DirectoryWithSnapshotFeature.java:61)
>
> org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$DirectoryDiff.destroyDiffAndCollectBlocks(DirectoryWithSnapshotFeature.java:319)
>
> org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$DirectoryDiff.destroyDiffAndCollectBlocks(DirectoryWithSnapshotFeature.java:167)
>
> org.apache.hadoop.hdfs.server.namenode.snapshot.AbstractINodeDiffList.deleteSnapshotDiff(AbstractINodeDiffList.java:83)
>
> org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature.cleanDirectory(DirectoryWithSnapshotFeature.java:745)
>
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtree(INodeDirectory.java:776)
>
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtreeRecursively(INodeDirectory.java:747)
>
> org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature.cleanDirectory(DirectoryWithSnapshotFeature.java:747)
>
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtree(INodeDirectory.java:776)
>
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtreeRecursively(INodeDirectory.java:747)
>
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtree(INodeDirectory.java:789)
> {noformat}
> After the ZKFC determined NameNode was down and advanced epoch, the NN
> finished deleting snapshot, and sent the edit to journal nodes, but it was
> rejected because epoch was updated. See the following stacktrace:
> {noformat}
> 10.0.16.21:8485: IPC's epoch 17 is less than the last promised epoch 18
> at
> org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:429)
> at
> org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:457)
> at
> org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:352)
> at
> org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:149)
> at
> org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
> at
> org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
> at
> org.apache.hadoop.hdfs.qjournal.client.QuorumException.create(QuorumException.java:81)
> at
> org.apache.hadoop.hdfs.qjournal.client.QuorumCall.rethrowException(QuorumCall.java:223)
> at
> org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:142)
> at
> org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
> at
> org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
> at
> org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
> at
> org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
> at
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
> at
> org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
> at
> org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:641)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteSnapshot(FSNamesystem.java:8507)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.deleteSnapshot(NameNodeRpcServer.java:1469)
> at
> org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.deleteSnapshot(AuthorizationProviderProxyClientProtocol.java:717)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.deleteSnapshot(ClientNamenodeProtocolServerSideTranslatorPB.java:1061)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
> {noformat}
> Finally NameNode shut itself down because it had too many quorum errors.
> Setting priority to critical because it resulted in NameNode crash.
> We think deleteSnapshot should be made asynchronous. It should delete the
> root of snapshot directory, and then put the rest of work into an
> asynchronous thread. Credit: [~yzhangal] for proposing this idea.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]