[ 
https://issues.apache.org/jira/browse/HADOOP-1018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Paranjpye updated HADOOP-1018:
-------------------------------------

    Component/s: mapred

> Single lost heartbeat leads to a "Lost task tracker"
> ----------------------------------------------------
>
>                 Key: HADOOP-1018
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1018
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.10.0, 0.11.2, 0.12.0
>         Environment: Nutch trunk/ (Hadoop 0.10.0), Linux, JDK 1.5, a cluster 
> of 9 machines.
>            Reporter: Andrzej Bialecki 
>
> Under heavy load, task tracker may lose the heartbeat response from the 
> JobTracker. Task tracker tries to resend the last heartbeat message, which 
> job tracker treats as "duplicate" response and ignores. Since task tracker 
> tries to resend the same heartbeat message, with the same id, over and over 
> again, no "valid" messages reach the job tracker, so after a while it 
> considers the task tracker to be lost. Task tracker cannot recover from this 
> state and needs to be restarted.
> Looking at Hadoop trunk/ I believe this problem still may occur - in 
> JobTracker.java.heartbeat():992 JobTracker should not ignore duplicate 
> messages but acknowledge them without processing. This would cause the task 
> tracker to sync back it's last heartbeat id with the last hearbeat id 
> remembered in the job tracker.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to