[ http://issues.apache.org/jira/browse/HADOOP-181?page=comments#action_12427954 ] Doug Cutting commented on HADOOP-181: -------------------------------------
> If a switch goes down for 15 minutes [ ... ] We'll currently have a lot of other problems if a switch goes down for 15 minutes. All of the other tasks will probably fail because DFS will no longer have complete copies of files. Is a switch going down for 15 minutes really a case we need to optimize? Is it acceptable to lose a few hours work on its hosts when a switch dies? When a switch fails, how long does it take to replace? We can answer some of this fairly precisely. What is the MTBF for switches? How many switches would we have in a 10k-node system? > task trackers should not restart for having a late heartbeat > ------------------------------------------------------------ > > Key: HADOOP-181 > URL: http://issues.apache.org/jira/browse/HADOOP-181 > Project: Hadoop > Issue Type: Bug > Components: mapred > Reporter: Owen O'Malley > Assigned To: Devaraj Das > Fix For: 0.6.0 > > Attachments: lost-heartbeat.patch > > > TaskTrackers should not close and restart themselves for having a late > heartbeat. The JobTracker should just accept their current status. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira