[jira] Commented: (HADOOP-181) task trackers should not restart for having a late heartbeat

Doug Cutting (JIRA) Mon, 14 Aug 2006 12:56:17 -0700

    [ 
http://issues.apache.org/jira/browse/HADOOP-181?page=comments#action_12427954 ] 
            
Doug Cutting commented on HADOOP-181:
-------------------------------------


> If a switch goes down for 15 minutes [ ... ]

We'll currently have a lot of other problems if a switch goes down for 15 
minutes.  All of the other tasks will probably fail because DFS will no longer 
have complete copies of files.

Is a switch going down for 15 minutes really a case we need to optimize?  Is it 
acceptable to lose a few hours work on its hosts when a switch dies?  When a 
switch fails, how long does it take to replace?

We can answer some of this fairly precisely.  What is the MTBF for switches?  
How many switches would we have in a 10k-node system?

> task trackers should not restart for having a late heartbeat
> ------------------------------------------------------------
>
>                 Key: HADOOP-181
>                 URL: http://issues.apache.org/jira/browse/HADOOP-181
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Owen O'Malley
>         Assigned To: Devaraj Das
>             Fix For: 0.6.0
>
>         Attachments: lost-heartbeat.patch
>
>
> TaskTrackers should not close and restart themselves for having a late 
> heartbeat. The JobTracker should just accept their current status.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-181) task trackers should not restart for having a late heartbeat

Reply via email to