[ https://issues.apache.org/jira/browse/HADOOP-1930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Arun C Murthy updated HADOOP-1930: ---------------------------------- Attachment: HADOOP-1930_1_20070922.patch Ok, I take my previous comment back. It is a tad more involved since we are calling {{JobInProgress.failedTask}} from {{JobInProgress.fetchFailureNotification}} with the wrong {{trackerName}}. However, at worst it leads to a couple of trackers being wrongly blacklisted i.e. penalized for failed tasks. Attached patch should fix it, I need to test this extensively... > Too many fetch-failures issue > ----------------------------- > > Key: HADOOP-1930 > URL: https://issues.apache.org/jira/browse/HADOOP-1930 > Project: Hadoop > Issue Type: Bug > Components: mapred > Affects Versions: 0.15.0 > Reporter: Christian Kunz > Assignee: Arun C Murthy > Priority: Blocker > Attachments: HADOOP-1930_1_20070922.patch > > > A job with 4000 maps on a 1400 node cluster (3 tasks per node allowed) had a > lot (150) of 'Too many fetch-failures' map failures. > From the jobtracker log it looks as if it got confused which tasktracker > actually ran the task: > (In the following log output, I replaced the corresponding tasktracker nodes > with ***node_assigned*** and ***node_fetch_attempt** and they are different) > grep task_200709170247_0018_m_000009_0 > hadoop-xxx-jobtracker-node.log.2007-09-19: > 2007-09-19 15:52:26,907 INFO org.apache.hadoop.mapred.JobTracker: Adding task > 'task_200709170247_0018_m_000009_0' to tip tip_200709170247_0018_m_000009, > for tracker 'tracker_***node_assigned_***:/127.0.0.1:54523' > 2007-09-19 15:58:03,111 INFO org.apache.hadoop.mapred.TaskRunner: Saved > output of task 'task_200709170247_0018_m_000009_0' to hdfs://location > 2007-09-19 15:58:03,111 INFO org.apache.hadoop.mapred.JobInProgress: Task > 'task_200709170247_0018_m_000009_0' has completed > tip_200709170247_0018_m_000009 successfully. > 2007-09-19 15:58:03,111 INFO org.apache.hadoop.mapred.TaskInProgress: Task > 'task_200709170247_0018_m_000009_0' has completed succesfully > 2007-09-19 16:21:07,825 INFO org.apache.hadoop.mapred.JobInProgress: Failed > fetch notification #1 for task task_200709170247_0018_m_000009_0 > 2007-09-19 16:23:23,483 INFO org.apache.hadoop.mapred.JobInProgress: Failed > fetch notification #2 for task task_200709170247_0018_m_000009_0 > 2007-09-19 16:25:07,182 INFO org.apache.hadoop.mapred.JobInProgress: Failed > fetch notification #3 for task task_200709170247_0018_m_000009_0 > 2007-09-19 16:25:07,182 INFO org.apache.hadoop.mapred.JobInProgress: Too many > fetch-failures for output of task: task_200709170247_0018_m_000009_0 ... > killing it > 2007-09-19 16:25:07,182 INFO org.apache.hadoop.mapred.TaskInProgress: Error > from task_200709170247_0018_m_000009_0: Too many fetch-failures > 2007-09-19 16:25:07,182 INFO org.apache.hadoop.mapred.TaskInProgress: Task > 'task_200709170247_0018_m_000009_0' has been lost. > 2007-09-19 16:25:07,184 INFO org.apache.hadoop.mapred.JobTracker: Removed > completed task 'task_200709170247_0018_m_000009_0' from > 'tracker_***node_fetch_attempt***:/127.0.0.1:48818' > 2007-09-19 21:40:00,235 INFO org.apache.hadoop.mapred.JobTracker: Removed > completed task 'task_200709170247_0018_m_000009_0' from > 'tracker_***node_fetch_attempt***:/127.0.0.1:48818' -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.