[
https://issues.apache.org/jira/browse/HDFS-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12908629#action_12908629
]
dhruba borthakur edited comment on HDFS-779 at 9/13/10 3:47 AM:
----------------------------------------------------------------
I would like to work on this one. A similar issue hit our cluster when a bad
job started creating millions of files within a few minutes, causing the NN to
get very very sluggish with create transactions and not being able to process
heartbeats from datanodes. Out of the 3000 datanodes in our cluster, about 1000
lost heartbeats with the namenode, thus causing a replication storm.
1. The key is to detect the difference between a network partition event versus
datanode processes actually dying due to normal wear-and-tear. A few datanodes
die every day, so there is a pretty well known rate of failure of datanodes.
For these cases, it is best if the NN starts replicating blocks as soon as it
detects that a datanode is dead. On the other hand, once-a-while catastrophic
event occurs (i.e. network partition, bad jobs causing denial-of-service for
datanode heartbeats, etc) that causes a bunch of datanodes to disappear right
away. For these catastrophic scenarios, it is better if the NN does not start
replication right away, otherwise it causes a replication storm; instead wait
for a slight-extended-time E to see if the catastrophic situation rectifies
itself, otherwise the replication storm actually increases the duration of the
downtime. It should not fall into safe-mode either, otherwise all existing jobs
will fail; instead it should delay replication for the time period determined
by E.
2. The failure mode here is a *datanode* (not a block or a set of blocks), so
it makes sense to make this heuristic be based on the number of datanodes,
rather than the number of blocks. The number of missing blocks is not a proxy
for the number of failed datanodes, because the datanodes in the cluster need
not be in a balanced state when the catastrophic event occured. Some datanodes
could be ery full while some other newly provisioned datanodes could be
relatively empty.
3. The distinguishing factor between a catastrophic event vs a regular one is
tightly correlated to the rate of failure of datanodes. In a typical cluster
of 3000 nodes, we see maybe 2 datanodes dying every day. On the other hand, a
catastrophic event triggers the loss of many hundred datanodes within a span of
a few minutes.
My proposal is as follows:
Let the namenode remember the count of "lost heartbeat from datanodes" in the
last day as well as the current day.
N1 = number of datanodes that died yesterday
N2 = number of datanodes that have died till now today
If (N2/N2 > m) then we declare it as a catastrophic event.
When a catastrophic event is detected, the namenode delays replication for a
slightly extended (configurable) period of time.
was (Author: dhruba):
I would like to work on this one. A similar issue hit our cluster when a
bad job started creating millions of files within a few minutes, casing the NN
to get very very with create transactions and not being able to process
heartbeats from datanodes. Out of the 3000 datanodes in our cluster, about 1000
lost heartbeats with the namenode, thus causing a replication storm.
1. From my perspective, the key is to detect the difference between a network
partition event versus datanode processes actually dying due to normal
wear-and-tear. A few datanodes die every day, so there is a pretty well known
rate of failure of datanodes. For these cases, it is best if the NN starts
replicating blocks as soon as it detects that a datanode is dead. On the other
hand, once a while catastrophic event occurs (i.e. network partition, bad jobs
causing denial-of-service for datanode heartbeats, etc) that cause a bunch of
datanodes to disappear right away. For these catastrophic scenarios, it is
better if the NN does not start replication right away, otherwise it causes a
replication storm; instead wait for a slight-extended-time E to see if the
catastrophic situation rectifies itself, otherwise the replication storm
actually increases the duration of the downtime. It should not fall into
safe-mode either, otherwise all existing jobs will fail; instead it should
delay replication for the time period determined by E.
2. The failure mode here is a *datanode* (not a block or a set of blocks), so
it makes sense to make this heuristic be based on the number of datanodes,
rather than the number of blocks. The failed-number of blocks is not a proxy
for the number of failed datanodes, because the datanodes in the cluster need
not be balanced state all the time.
3. The distinguishing factor between a catastrophic event vs a regular one
seems to be highly correlated to the rate of failure of datanodes. In a typical
cluster of 3000 nodes, we see maybe 2 datanodes dying every day. On the other
hand, a catastrophic event triggers the loss of many hundred datanodes within a
span of a few minutes.
My proposal is as follows:
Let the namenode remember the count of "lost heartbeat from datanodes" in the
last day as well as the current day.
N1 = number of datanodes that died yesterday
N2 = number of datanodes that have died till now today
If (N2/N2 > m) then we declare it as a catastrophic event.
When a catastrophic event is detected, the namenode delays replication for a
slightly extended (configurable) period of time.
> Automatic move to safe-mode when cluster size drops
> ---------------------------------------------------
>
> Key: HDFS-779
> URL: https://issues.apache.org/jira/browse/HDFS-779
> Project: Hadoop HDFS
> Issue Type: New Feature
> Components: name-node
> Reporter: Owen O'Malley
> Assignee: dhruba borthakur
>
> As part of looking at using Kerberos, we want to avoid the case where both
> the primary (and optional secondary) KDC go offline causing a replication
> storm as the DataNodes' service tickets time out and they lose the ability to
> connect to the NameNode. However, this is a specific case of a more general
> problem of loosing too many nodes too quickly. I think we should have an
> option to go into safe mode if the cluster size goes down more than N% in
> terms of DataNodes.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.