[ 
https://issues.apache.org/jira/browse/HDFS-5060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13729570#comment-13729570
 ] 

Kihwal Lee commented on HDFS-5060:
----------------------------------

This will be a fine feature, as long as it can be turned off for the people who 
have their own monitoring set up. Users would probably prefer receiving alerts, 
so they can fix the issue without any service disruption. Short of a standard 
monitoring and alert system for Hadoop, we could at least display a warning on 
the web UI, if no checkpointing was done during the last checkpointing period, 
regardless of its auto saveNameSpace setting.
                
> NN should proactively perform a saveNamespace if it has a huge number of 
> outstanding uncheckpointed transactions
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-5060
>                 URL: https://issues.apache.org/jira/browse/HDFS-5060
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.1.0-beta
>            Reporter: Aaron T. Myers
>            Assignee: Aaron T. Myers
>
> In a properly-functioning HDFS system, checkpoints will be triggered either 
> by the secondary NN or standby NN regularly, by default every hour or 1MM 
> outstanding edits transactions, whichever come first. However, in cases where 
> this second node is down for an extended period of time, the number of 
> outstanding transactions can grow so large as to cause a restart to take an 
> inordinately long time.
> This JIRA proposes to make the active NN monitor its number of outstanding 
> transactions and perform a proactive local saveNamespace if it grows beyond a 
> configurable threshold. I'm envisioning something like 10x the configured 
> number of transactions which in a properly-functioning cluster would result 
> in a checkpoint from the second NN. Though this would be disruptive to 
> clients while it's taking place, likely for a few minutes, this seems better 
> than the alternative of a subsequent multi-hour restart and should never 
> actually occur in a properly-functioning cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to