[ 
https://issues.apache.org/jira/browse/HDFS-5060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13729660#comment-13729660
 ] 

Suresh Srinivas edited comment on HDFS-5060 at 8/5/13 5:16 PM:
---------------------------------------------------------------

Adding to what Kihwal said, this should be turned off by default. I think 
disrupting a running service is a big problem with the proposed approach. How 
often have you seen this issue that warrants a change like this? Why cannot 
bringing up a secondary/standby be a solution?

The issue that I have seen (quite infrequently though) is, secondary not being 
able to checkpoint due to editlog corruption. I created HDFS-4923 for this; if 
an operator forgets to manually save the namespace, during shutdown time the 
system could save the namespace automatically. This solves several issues 
mentioned in the jira.

                
      was (Author: sureshms):
    Adding to what Kihwal said, this should be turned off by default. I think 
disrupting a running service is a big problem with the proposed approach. How 
often have you seen this issue that warrants a change like this? Why cannot 
bringing up a secondary/standby a solution?

The issue that I have seen (quite infrequently though) is, secondary not being 
able to checkpoint due to editlog corruption. I created HDFS-4923 where, if an 
operator forgets to save the namespace, during shutdown time the system could 
save the namespace automatically. This solves several issues mentioned in the 
jria.

                  
> NN should proactively perform a saveNamespace if it has a huge number of 
> outstanding uncheckpointed transactions
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-5060
>                 URL: https://issues.apache.org/jira/browse/HDFS-5060
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.1.0-beta
>            Reporter: Aaron T. Myers
>            Assignee: Aaron T. Myers
>
> In a properly-functioning HDFS system, checkpoints will be triggered either 
> by the secondary NN or standby NN regularly, by default every hour or 1MM 
> outstanding edits transactions, whichever come first. However, in cases where 
> this second node is down for an extended period of time, the number of 
> outstanding transactions can grow so large as to cause a restart to take an 
> inordinately long time.
> This JIRA proposes to make the active NN monitor its number of outstanding 
> transactions and perform a proactive local saveNamespace if it grows beyond a 
> configurable threshold. I'm envisioning something like 10x the configured 
> number of transactions which in a properly-functioning cluster would result 
> in a checkpoint from the second NN. Though this would be disruptive to 
> clients while it's taking place, likely for a few minutes, this seems better 
> than the alternative of a subsequent multi-hour restart and should never 
> actually occur in a properly-functioning cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to