[ 
https://issues.apache.org/jira/browse/HDFS-6353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-6353:
----------------------------
    Attachment: HDFS-6353.000.patch

Had an offline discussion with [~sureshms]. One idea is to add a 
"saveNamesapce" call in the hdfs script before stopping the NameNode. The 
NameNode then does an extra checkpoint if and only if no checkpoint has been 
done during the past several checkpoint periods.

Upload an initial patch to demo the idea. The patch adds an extra argument to 
the saveNamespace rpc call to indicate the time window for the extra 
checkpoint. An extra option {{-beforeShutdown}} is added to {{dfsadmin 
-saveNamespace}} to trigger this functionality.

> Handle checkpoint failure more gracefully
> -----------------------------------------
>
>                 Key: HDFS-6353
>                 URL: https://issues.apache.org/jira/browse/HDFS-6353
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: namenode
>            Reporter: Suresh Srinivas
>            Assignee: Jing Zhao
>         Attachments: HDFS-6353.000.patch
>
>
> One of the failure patterns I have seen is, in some rare circumstances, due 
> to some inconsistency the secondary or standby fails to consume editlog. The 
> only solution when this happens is to save the namespace at the current 
> active namenode. But sometimes when this happens, unsuspecting admin might 
> end up restarting the namenode, requiring more complicated solution to the 
> problem (such as ignore editlog record that cannot be consumed etc.).
> How about adding the following functionality:
> When checkpointer (standby or secondary) fails to consume editlog, based on a 
> configurable flag (on/off) to let the active namenode know about this 
> failure. Active namenode can enters safemode and saves namespace. When  in 
> this type of safemode, namenode UI also shows information about checkpoint 
> failure and that it is saving namespace. Once the namespace is saved, 
> namenode can come out of safemode.
> This means service unavailability (even in HA cluster). But it might be worth 
> it to avoid long startup times or need for other manual fixes. Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to