[ 
https://issues.apache.org/jira/browse/HDFS-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14080148#comment-14080148
 ] 

Robert Chansler commented on HDFS-779:
--------------------------------------

It's a lot of fun revisiting the past while Allen is on a campaign to clean up 
Jira!

That this issue seems to have been ignored for four years seems testimony that 
simultaneously losing multiple racks--without taking out the Name Node--is not 
a serious problem in practice. Still my intuition says that if you ever got to 
a situation where the replication queue is multiple racks long, you are better 
off in safe mode than a desperate scramble to do replication that may be doomed.

So, does anyone have a report of losing multiple racks and the system 
recovering--or not? 

> Automatic move to safe-mode when cluster size drops
> ---------------------------------------------------
>
>                 Key: HDFS-779
>                 URL: https://issues.apache.org/jira/browse/HDFS-779
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: namenode
>            Reporter: Owen O'Malley
>            Assignee: dhruba borthakur
>
> As part of looking at using Kerberos, we want to avoid the case where both 
> the primary (and optional secondary) KDC go offline causing a replication 
> storm as the DataNodes' service tickets time out and they lose the ability to 
> connect to the NameNode. However, this is a specific case of a more general 
> problem of loosing too many nodes too quickly. I think we should have an 
> option to go into safe mode if the cluster size goes down more than N% in 
> terms of DataNodes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to