[jira] [Comment Edited] (HDFS-11225) NameNode crashed because deleteSnapshot held FSNamesystem lock too long

Manoj Govindassamy (JIRA) Fri, 14 Apr 2017 15:36:07 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-11225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15969629#comment-15969629
 ]


Manoj Govindassamy edited comment on HDFS-11225 at 4/14/17 10:35 PM:
---------------------------------------------------------------------

*Problem:*
* Unlike {{INodeDirectory}}, {{DirectoryWithSnapshotFeature}} doesn't have its 
children maintained in a plain list. Instead, there is {{DirectoryDiffList}} 
which is a list of {{DirectoryDiff}}.
* On every new snapshot, {{DirectoryWithSnapshotFeature}} updates its diffList 
with a new entry and all the subsequent file creation/deletions will be 
recorded in the last taken snapshot.
* So, the snap diff list which {{DirectoryWithSnapshotFeature}} maintains is 
more of Delta/Diff of file creation and deletions since the last snapshot. This 
is deliberate design so as top keep the order of snapshot creation to a 
constant.
* Snapshot deletion operation needs to visit all children files for the 
snapshot to reclaim blocks, and 
{{DirectoryWithSnapshotFeature#DirectoryDiff#getChildrenList()}} is invoked to 
get the list.
* To get the children list for any snapshot {{Sx}} under a directory,  all the 
snapshot diff records after {{Sx}} are combined and reversed from the current 
children list of the directory.
* So, listing children under a Snapshot {{Sx}} directory is the order of 
(#Snapshots after Sx * #FileDiffs in each of those snapshots). With thousands 
of snapshots and with 100s of thousands of files, this listing operations can 
easily consume 10s of seconds.
* Above all, all these operations are done by a single threads and one 
directory at a time, in a recursive fashion. In my testing, I have seen 
snapshot deletion taking 45+ seconds with a fairly unloaded NN.


was (Author: manojg):
*Problem:*
* In order to get the children list for any given snapshot, 
{{DirectoryWithSnapshotFeature#DirectoryDiff#getChildrenList()}} is invoked.
* Unlike {{INodeDirectory}}, {{DirectoryWithSnapshotFeature}} doesn't have its 
children maintained in a plain list. Instead, there is {{DirectoryDiffList}} 
which is a list of {{DirectoryDiff}}.
* On every new snapshot, {{DirectoryWithSnapshotFeature}} updates its diffList 
with a new entry and all the subsequent file creation/deletions will be 
recorded in the last taken snapshot.
* So, the snap diff list which {{DirectoryWithSnapshotFeature}} maintains is 
more of Delta/Diff of file creation and deletions since the last snapshot. This 
is deliberate design so as top keep the order of snapshot creation to a 
constant.
* Snapshot deletion operation needs to visit all files for the snapshot to 
reclaim blocks. To get the children list for any snapshot {{Sx}} under a 
directory,  all the snapshot diff records after {{Sx}} are combined and 
reversed from the current children list of the directory.
* So, listing children under a Snapshot {{Sx}} directory is the order of 
(#Snapshots after Sx * #FileDiffs in each of those snapshots). With thousands 
of snapshots and with 100s of thousands of files, this listing operations can 
easily consume 10s of seconds.
* Above all, all these operations are done by a single threads and one 
directory at a time, in a recursive fashion. In my testing, I have seen 
snapshot deletion taking 45+ seconds with a fairly unloaded NN.

> NameNode crashed because deleteSnapshot held FSNamesystem lock too long
> -----------------------------------------------------------------------
>
>                 Key: HDFS-11225
>                 URL: https://issues.apache.org/jira/browse/HDFS-11225
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.4.0
>         Environment: CDH5.8.2, HA
>            Reporter: Wei-Chiu Chuang
>            Assignee: Manoj Govindassamy
>            Priority: Critical
>              Labels: high-availability
>
> The deleteSnapshot operation is synchronous. In certain situations this 
> operation may hold FSNamesystem lock for too long, bringing almost every 
> NameNode operation to a halt.
> We have observed one incidence where it took so long that ZKFC believes the 
> NameNode is down. All other IPC threads were waiting to acquire FSNamesystem 
> lock. This specific deleteSnapshot took ~70 seconds. ZKFC has connection 
> timeout of 45 seconds by default, and if all IPC threads wait for 
> FSNamesystem lock and can't accept new incoming connection, ZKFC times out, 
> advances epoch and NameNode will therefore lose its active NN role and then 
> fail.
> Relevant log:
> {noformat}
> Thread 154 (IPC Server handler 86 on 8020):
>   State: RUNNABLE
>   Blocked count: 2753455
>   Waited count: 89201773
>   Stack:
>     
> org.apache.hadoop.hdfs.server.namenode.INode$BlocksMapUpdateInfo.addDeleteBlock(INode.java:879)
>     
> org.apache.hadoop.hdfs.server.namenode.INodeFile.destroyAndCollectBlocks(INodeFile.java:508)
>     
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.destroyAndCollectBlocks(INodeDirectory.java:763)
>     
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.destroyAndCollectBlocks(INodeDirectory.java:763)
>     
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.destroyAndCollectBlocks(INodeDirectory.java:763)
>     
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.destroyAndCollectBlocks(INodeDirectory.java:763)
>     
> org.apache.hadoop.hdfs.server.namenode.INodeReference.destroyAndCollectBlocks(INodeReference.java:339)
>     
> org.apache.hadoop.hdfs.server.namenode.INodeReference$WithName.destroyAndCollectBlocks(INodeReference.java:606)
>     
> org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$ChildrenDiff.destroyDeletedList(DirectoryWithSnapshotFeature.java:119)
>     
> org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$ChildrenDiff.access$400(DirectoryWithSnapshotFeature.java:61)
>     
> org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$DirectoryDiff.destroyDiffAndCollectBlocks(DirectoryWithSnapshotFeature.java:319)
>     
> org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$DirectoryDiff.destroyDiffAndCollectBlocks(DirectoryWithSnapshotFeature.java:167)
>     
> org.apache.hadoop.hdfs.server.namenode.snapshot.AbstractINodeDiffList.deleteSnapshotDiff(AbstractINodeDiffList.java:83)
>     
> org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature.cleanDirectory(DirectoryWithSnapshotFeature.java:745)
>     
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtree(INodeDirectory.java:776)
>     
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtreeRecursively(INodeDirectory.java:747)
>     
> org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature.cleanDirectory(DirectoryWithSnapshotFeature.java:747)
>     
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtree(INodeDirectory.java:776)
>     
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtreeRecursively(INodeDirectory.java:747)
>     
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtree(INodeDirectory.java:789)
> {noformat}
> After the ZKFC determined NameNode was down and advanced epoch, the NN 
> finished deleting snapshot, and sent the edit to journal nodes, but it was 
> rejected because epoch was updated. See the following stacktrace:
> {noformat}
> 10.0.16.21:8485: IPC's epoch 17 is less than the last promised epoch 18
>         at 
> org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:429)
>         at 
> org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:457)
>         at 
> org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:352)
>         at 
> org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:149)
>         at 
> org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
>         at 
> org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
>         at 
> org.apache.hadoop.hdfs.qjournal.client.QuorumException.create(QuorumException.java:81)
>         at 
> org.apache.hadoop.hdfs.qjournal.client.QuorumCall.rethrowException(QuorumCall.java:223)
>         at 
> org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:142)
>         at 
> org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
>         at 
> org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
>         at 
> org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
>         at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
>         at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
>         at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
>         at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:641)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteSnapshot(FSNamesystem.java:8507)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.deleteSnapshot(NameNodeRpcServer.java:1469)
>         at 
> org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.deleteSnapshot(AuthorizationProviderProxyClientProtocol.java:717)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.deleteSnapshot(ClientNamenodeProtocolServerSideTranslatorPB.java:1061)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
> {noformat}
> Finally NameNode shut itself down because it had too many quorum errors.
> Setting priority to critical because it resulted in NameNode crash.
> We think deleteSnapshot should be made asynchronous. It should delete the 
> root of snapshot directory, and then put the rest of work into an 
> asynchronous thread. Credit: [~yzhangal] for proposing this idea.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (HDFS-11225) NameNode crashed because deleteSnapshot held FSNamesystem lock too long

Reply via email to