[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption

Rushabh S Shah (JIRA) Mon, 19 Mar 2018 14:53:02 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16405506#comment-16405506
 ]


Rushabh S Shah commented on HDFS-13314:
---------------------------------------

bq. In the cases we ran into, the corrupted image was loadable after bypassing 
some checks during NameNode startup. 
You need to change the namenode code and build again to bypass or are you 
talking about some conf ?

{quote}
The corruption was detected the next time a NameNode is restarted which may be 
weeks after it occurred.
 The default value of dfs.namenode.num.checkpoints.retained is 2, so the older 
image is not lost.
The purge step is skipped if a bad image was written.
{quote}
As you mentioned it takes few weeks to detect that a bad image was written.
Coming from a company which has huge churn of write ops, we checkpoint atleast 
every 12 hours and write image of about 25GB.
On top of that if we don't purge the old images in case of corruption, we will 
run out of disk space in 3-4 weeks.

IMO instead of putting a hack in namenode, we should actively chase the bugs 
and root cause them.
Also I still think the default value should be *to exit*.

> NameNode should optionally exit if it detects FsImage corruption
> ----------------------------------------------------------------
>
>                 Key: HDFS-13314
>                 URL: https://issues.apache.org/jira/browse/HDFS-13314
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>            Reporter: Arpit Agarwal
>            Assignee: Arpit Agarwal
>            Priority: Major
>         Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch
>
>
> The NameNode should optionally exit after writing an FsImage if it detects 
> the following kinds of corruptions:
> # INodeReference pointing to non-existent INode
> # Duplicate entries in snapshot deleted diff list.
> This behavior is controlled via an undocumented configuration setting, and 
> disabled by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption

Reply via email to