improve handling of datanode timeouts
-------------------------------------
Key: HDFS-2420
URL: https://issues.apache.org/jira/browse/HDFS-2420
Project: Hadoop HDFS
Issue Type: Improvement
Reporter: Ron Bodkin
If a datanode ever times out on a heart beat, it gets marked dead permanently.
I am finding that on AWS this is a periodic occurrence, i.e., datanodes time
out although the datanode process is still alive. The current solution to this
is to kill and restart each such process independently.
It would be good if there were more retry logic (e.g., blacklisting the nodes
but try heartbeats for a longer period before determining they are apparently
dead). It would also be good if refreshNodes would check and attempt to recover
timed out data nodes.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira