This is a long known issue.. deleting files takes a lot of time and
datanode does not heart beat during that time. Please file a jira so
that the issue percolates up :)
There are more of these cases that result in datanode being marked dead.
As work around, you can double or triple heartbeat.recheck.interval
(default 5000 millseconds) in config.
Raghu.
Jonathan Gray wrote:
In many of my jobs I create intermediate data in HDFS. I keep this around for
a number of days for inspection until I delete it in large batches.
If I attempt to delete all of it at once, the flood of delete messages to the
datanodes seems to cause starvation as they do not seem to responding to the
namenode heartbeat. There is about <5% cpu utilization and the logs just show
the deletion of blocks.
In the worst case (if I'm deleting a few terabytes, about 50% of total capacity
across 10 nodes) this causes the master to expire the datanode leases. Once
the datanode finishes deletions, it reports back to master and is added back to
the cluster. At this point, its blocks have already started to be reassigned
so it then starts as an empty node. In one run, this happened to 8 out of 10
nodes before getting back to a steady state. There were a couple moments
during that run that a number of the blocks had replication 1.
Obviously I can handle this by deleting less at any one time, but it seems like
there might be something wrong. With no CPU utilization, why does the datanode
not respond to the namenode?
Thanks.
Jonathan Gray