[ 
https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16405373#comment-16405373
 ] 

Arpit Agarwal edited comment on HDFS-13314 at 3/19/18 8:30 PM:
---------------------------------------------------------------

We've seen two FsImage corruption symptoms correlated with heavy usage of HDFS 
snapshots. 

# Dangling INodeReferences (likely the same as HDFS-13101)
# Duplicate entries in snapshot diff list (this may have been caused by 
attempting to work around #1)

This usually occurs when someone has dozens of snapshots on a large directory 
e.g. {{/}}, or {{/apps/hive/warehouse}}. We have not been able to repro the 
problem with load testing yet.

The corruption detected the next time a NameNode is restarted which may be 
weeks after it occurred. Since both problems can be trivially detected while 
writing the FsImage, this patch proposes that the NameNode self-terminate after 
writing a bad image.


was (Author: arpitagarwal):
We've seen two FsImage corruption symptoms correlated with heavy usage of HDFS 
snapshots. 

# Dangling INodeReferences (likely the same as HDFS-13101)
# Duplicate entries in snapshot diff list (this may have been caused by 
attempting to work #1)

This usually occurs when someone has dozens of snapshots on a large directory 
e.g. {{/}}, or {{/apps/hive/warehouse}}. We have not been able to repro the 
problem with load testing yet.

The corruption detected the next time a NameNode is restarted which may be 
weeks after it occurred. Since both problems can be trivially detected while 
writing the FsImage, this patch proposes that the NameNode self-terminate after 
writing a bad image.

> NameNode should optionally exit if it detects FsImage corruption
> ----------------------------------------------------------------
>
>                 Key: HDFS-13314
>                 URL: https://issues.apache.org/jira/browse/HDFS-13314
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>            Reporter: Arpit Agarwal
>            Assignee: Arpit Agarwal
>            Priority: Major
>         Attachments: HDFS-13314.01.patch
>
>
> The NameNode should optionally exit after writing an FsImage if it detects 
> the following kinds of corruptions:
> # INodeReference pointing to non-existent INode
> # Duplicate entries in snapshot deleted diff list.
> This behavior is controlled via an undocumented configuration setting, and 
> disabled by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to