This is a long known issue.. deleting files takes a lot of time and datanode does not heart beat during that time. Please file a jira so that the issue percolates up :)

There are more of these cases that result in datanode being marked dead.

As work around, you can double or triple heartbeat.recheck.interval (default 5000 millseconds) in config.

Raghu.

Jonathan Gray wrote:
In many of my jobs I create intermediate data in HDFS.  I keep this around for 
a number of days for inspection until I delete it in large batches.

If I attempt to delete all of it at once, the flood of delete messages to the 
datanodes seems to cause starvation as they do not seem to responding to the 
namenode heartbeat.  There is about <5% cpu utilization and the logs just show 
the deletion of blocks.

In the worst case (if I'm deleting a few terabytes, about 50% of total capacity 
across 10 nodes) this causes the master to expire the datanode leases.  Once 
the datanode finishes deletions, it reports back to master and is added back to 
the cluster.  At this point, its blocks have already started to be reassigned 
so it then starts as an empty node.  In one run, this happened to 8 out of 10 
nodes before getting back to a steady state.  There were a couple moments 
during that run that a number of the blocks had replication 1.

Obviously I can handle this by deleting less at any one time, but it seems like 
there might be something wrong.  With no CPU utilization, why does the datanode 
not respond to the namenode?

Thanks.

Jonathan Gray


Reply via email to