[ https://issues.apache.org/jira/browse/HADOOP-1018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sameer Paranjpye updated HADOOP-1018: ------------------------------------- Component/s: mapred > Single lost heartbeat leads to a "Lost task tracker" > ---------------------------------------------------- > > Key: HADOOP-1018 > URL: https://issues.apache.org/jira/browse/HADOOP-1018 > Project: Hadoop > Issue Type: Bug > Components: mapred > Affects Versions: 0.10.0, 0.11.2, 0.12.0 > Environment: Nutch trunk/ (Hadoop 0.10.0), Linux, JDK 1.5, a cluster > of 9 machines. > Reporter: Andrzej Bialecki > > Under heavy load, task tracker may lose the heartbeat response from the > JobTracker. Task tracker tries to resend the last heartbeat message, which > job tracker treats as "duplicate" response and ignores. Since task tracker > tries to resend the same heartbeat message, with the same id, over and over > again, no "valid" messages reach the job tracker, so after a while it > considers the task tracker to be lost. Task tracker cannot recover from this > state and needs to be restarted. > Looking at Hadoop trunk/ I believe this problem still may occur - in > JobTracker.java.heartbeat():992 JobTracker should not ignore duplicate > messages but acknowledge them without processing. This would cause the task > tracker to sync back it's last heartbeat id with the last hearbeat id > remembered in the job tracker. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.