[
https://issues.apache.org/jira/browse/HDFS-5060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13729660#comment-13729660
]
Suresh Srinivas edited comment on HDFS-5060 at 8/5/13 5:16 PM:
---------------------------------------------------------------
Adding to what Kihwal said, this should be turned off by default. I think
disrupting a running service is a big problem with the proposed approach. How
often have you seen this issue that warrants a change like this? Why cannot
bringing up a secondary/standby be a solution?
The issue that I have seen (quite infrequently though) is, secondary not being
able to checkpoint due to editlog corruption. I created HDFS-4923 for this; if
an operator forgets to manually save the namespace, during shutdown time the
system could save the namespace automatically. This solves several issues
mentioned in the jira.
was (Author: sureshms):
Adding to what Kihwal said, this should be turned off by default. I think
disrupting a running service is a big problem with the proposed approach. How
often have you seen this issue that warrants a change like this? Why cannot
bringing up a secondary/standby a solution?
The issue that I have seen (quite infrequently though) is, secondary not being
able to checkpoint due to editlog corruption. I created HDFS-4923 where, if an
operator forgets to save the namespace, during shutdown time the system could
save the namespace automatically. This solves several issues mentioned in the
jria.
> NN should proactively perform a saveNamespace if it has a huge number of
> outstanding uncheckpointed transactions
> ----------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-5060
> URL: https://issues.apache.org/jira/browse/HDFS-5060
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: namenode
> Affects Versions: 2.1.0-beta
> Reporter: Aaron T. Myers
> Assignee: Aaron T. Myers
>
> In a properly-functioning HDFS system, checkpoints will be triggered either
> by the secondary NN or standby NN regularly, by default every hour or 1MM
> outstanding edits transactions, whichever come first. However, in cases where
> this second node is down for an extended period of time, the number of
> outstanding transactions can grow so large as to cause a restart to take an
> inordinately long time.
> This JIRA proposes to make the active NN monitor its number of outstanding
> transactions and perform a proactive local saveNamespace if it grows beyond a
> configurable threshold. I'm envisioning something like 10x the configured
> number of transactions which in a properly-functioning cluster would result
> in a checkpoint from the second NN. Though this would be disruptive to
> clients while it's taking place, likely for a few minutes, this seems better
> than the alternative of a subsequent multi-hour restart and should never
> actually occur in a properly-functioning cluster.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira