[ 
https://issues.apache.org/jira/browse/HDFS-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914698#action_12914698
 ] 

Robert Chansler commented on HDFS-779:
--------------------------------------

Dhruba's suggests creating a new operation mode in addition to normal and safe 
modes. His choice to call it  "catastrophic mode" answers a lot of questions. I 
don't want to ever have to explain to Allen that the system is operating in  
"catastrophic mode."

Sure, some job(s) may be able to continue while the replication queue is filled 
with tens of millions of requests.  But other jobs will find the thousands of 
missing blocks, and each user who finds a missing block files a  ticket, and 
each ticket about a missing block ends up in my in-box. No intuition suggests 
that there is much need for an operational state where user jobs are a good 
idea, but where the system is so degraded that replication is a bad idea.

So I'm very minimalist-minded. There should be a single parameter _N_. If the 
replication queue is longer than _N_ the system retreats to safe mode. All 
safe-mode rules apply, including the rules for leaving safe mode. The 
administrator can reset the value of _N_ at any time. Make the default value of 
_N_ be MAX_LONG for the folks who don't think this is a problem they have.

> Automatic move to safe-mode when cluster size drops
> ---------------------------------------------------
>
>                 Key: HDFS-779
>                 URL: https://issues.apache.org/jira/browse/HDFS-779
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: name-node
>            Reporter: Owen O'Malley
>            Assignee: dhruba borthakur
>
> As part of looking at using Kerberos, we want to avoid the case where both 
> the primary (and optional secondary) KDC go offline causing a replication 
> storm as the DataNodes' service tickets time out and they lose the ability to 
> connect to the NameNode. However, this is a specific case of a more general 
> problem of loosing too many nodes too quickly. I think we should have an 
> option to go into safe mode if the cluster size goes down more than N% in 
> terms of DataNodes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to