[ 
https://issues.apache.org/jira/browse/HDFS-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12908629#action_12908629
 ] 

dhruba borthakur edited comment on HDFS-779 at 9/13/10 3:47 AM:
----------------------------------------------------------------

I would like to work on this one. A similar issue hit our cluster when a bad 
job started creating millions of files within a few minutes, causing the NN to 
get very very sluggish with create transactions and not being able to process 
heartbeats from datanodes. Out of the 3000 datanodes in our cluster, about 1000 
lost heartbeats with the namenode, thus causing a replication storm.

1. The key is to detect the difference between a network partition event versus 
datanode processes actually dying due to normal wear-and-tear. A few datanodes 
die every day, so there is a pretty well known rate of failure of datanodes. 
For these cases, it is best if the NN starts replicating blocks as soon as it 
detects that a datanode is dead. On the other hand, once-a-while catastrophic 
event occurs (i.e. network partition, bad jobs causing denial-of-service for 
datanode heartbeats, etc) that causes a bunch of datanodes to disappear right 
away. For these catastrophic scenarios, it is better if the NN does not start 
replication right away, otherwise it causes a replication storm; instead wait 
for a slight-extended-time E to see if the catastrophic situation rectifies 
itself, otherwise the replication storm actually increases the duration of the 
downtime. It should not fall into safe-mode either, otherwise all existing jobs 
will fail; instead it should delay replication for the time period determined 
by E.

2. The failure mode here is a *datanode* (not a block or a set of blocks), so 
it makes sense to make this heuristic be based on the number of datanodes, 
rather than the number of blocks. The number of missing blocks is not a proxy 
for the number of failed datanodes, because the datanodes in the cluster need 
not be in a  balanced state when the catastrophic event occured. Some datanodes 
could be ery full while some other newly provisioned datanodes could be 
relatively empty.

3. The distinguishing factor between a catastrophic event vs a regular one is 
tightly  correlated to the rate of failure of datanodes. In a typical cluster 
of 3000 nodes, we see maybe 2 datanodes dying every day. On the other hand, a 
catastrophic event triggers the loss of many hundred datanodes within a span of 
a few minutes.

My proposal is as follows:

Let the namenode remember the count of "lost heartbeat from datanodes" in the 
last day as well as the current day.
  N1 = number of datanodes that died yesterday
  N2 = number of datanodes that have died till now today
  If (N2/N2 >   m) then we declare it as a catastrophic event.

When a catastrophic event is detected, the namenode delays replication for a 
slightly extended (configurable) period of time.


      was (Author: dhruba):
    I would like to work on this one. A similar issue hit our cluster when a 
bad job started creating millions of files within a few minutes, casing the NN 
to get very very with create transactions and not being able to process 
heartbeats from datanodes. Out of the 3000 datanodes in our cluster, about 1000 
lost heartbeats with the namenode, thus causing a replication storm.

1. From my perspective, the key is to detect the difference between a network 
partition event versus datanode processes actually dying due to normal 
wear-and-tear. A few datanodes die every day, so there is a pretty well known 
rate of failure of datanodes. For these cases, it is best if the NN starts 
replicating blocks as soon as it detects that a datanode is dead. On the other 
hand, once a while catastrophic event occurs (i.e. network partition, bad jobs 
causing denial-of-service for datanode heartbeats, etc) that cause a bunch of 
datanodes to disappear right away. For these catastrophic scenarios, it is 
better if the NN does not start replication right away, otherwise it causes a 
replication storm; instead wait for a slight-extended-time E to see if the 
catastrophic situation rectifies itself, otherwise the replication storm 
actually increases the duration of the downtime. It should not fall into 
safe-mode either, otherwise all existing jobs will fail; instead it should 
delay replication for the time period determined by E.

2. The failure mode here is a *datanode* (not a block or a set of blocks), so 
it makes sense to make this heuristic be based on the number of datanodes, 
rather than the number of blocks. The failed-number of blocks is not a proxy 
for the number of failed datanodes, because the datanodes in the cluster need 
not be balanced state all the time.

3. The distinguishing factor between a catastrophic event vs a regular one 
seems to be highly correlated to the rate of failure of datanodes. In a typical 
cluster of 3000 nodes, we see maybe 2 datanodes dying every day. On the other 
hand, a catastrophic event triggers the loss of many hundred datanodes within a 
span of a few minutes.

My proposal is as follows:

Let the namenode remember the count of "lost heartbeat from datanodes" in the 
last day as well as the current day.
  N1 = number of datanodes that died yesterday
  N2 = number of datanodes that have died till now today
  If (N2/N2 >   m) then we declare it as a catastrophic event.

When a catastrophic event is detected, the namenode delays replication for a 
slightly extended (configurable) period of time.







  
> Automatic move to safe-mode when cluster size drops
> ---------------------------------------------------
>
>                 Key: HDFS-779
>                 URL: https://issues.apache.org/jira/browse/HDFS-779
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: name-node
>            Reporter: Owen O'Malley
>            Assignee: dhruba borthakur
>
> As part of looking at using Kerberos, we want to avoid the case where both 
> the primary (and optional secondary) KDC go offline causing a replication 
> storm as the DataNodes' service tickets time out and they lose the ability to 
> connect to the NameNode. However, this is a specific case of a more general 
> problem of loosing too many nodes too quickly. I think we should have an 
> option to go into safe mode if the cluster size goes down more than N% in 
> terms of DataNodes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to