[
https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16405373#comment-16405373
]
Arpit Agarwal edited comment on HDFS-13314 at 3/19/18 8:30 PM:
---------------------------------------------------------------
We've seen two FsImage corruption symptoms correlated with heavy usage of HDFS
snapshots.
# Dangling INodeReferences (likely the same as HDFS-13101)
# Duplicate entries in snapshot diff list (this may have been caused by
attempting to work around #1)
This usually occurs when someone has dozens of snapshots on a large directory
e.g. {{/}}, or {{/apps/hive/warehouse}}. We have not been able to repro the
problem with load testing yet.
The corruption detected the next time a NameNode is restarted which may be
weeks after it occurred. Since both problems can be trivially detected while
writing the FsImage, this patch proposes that the NameNode self-terminate after
writing a bad image.
was (Author: arpitagarwal):
We've seen two FsImage corruption symptoms correlated with heavy usage of HDFS
snapshots.
# Dangling INodeReferences (likely the same as HDFS-13101)
# Duplicate entries in snapshot diff list (this may have been caused by
attempting to work #1)
This usually occurs when someone has dozens of snapshots on a large directory
e.g. {{/}}, or {{/apps/hive/warehouse}}. We have not been able to repro the
problem with load testing yet.
The corruption detected the next time a NameNode is restarted which may be
weeks after it occurred. Since both problems can be trivially detected while
writing the FsImage, this patch proposes that the NameNode self-terminate after
writing a bad image.
> NameNode should optionally exit if it detects FsImage corruption
> ----------------------------------------------------------------
>
> Key: HDFS-13314
> URL: https://issues.apache.org/jira/browse/HDFS-13314
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: namenode
> Reporter: Arpit Agarwal
> Assignee: Arpit Agarwal
> Priority: Major
> Attachments: HDFS-13314.01.patch
>
>
> The NameNode should optionally exit after writing an FsImage if it detects
> the following kinds of corruptions:
> # INodeReference pointing to non-existent INode
> # Duplicate entries in snapshot deleted diff list.
> This behavior is controlled via an undocumented configuration setting, and
> disabled by default.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]