[ https://issues.apache.org/jira/browse/HADOOP-979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Arun C Murthy updated HADOOP-979: --------------------------------- Comment: was deleted > speculative task failure can kill jobs > -------------------------------------- > > Key: HADOOP-979 > URL: https://issues.apache.org/jira/browse/HADOOP-979 > Project: Hadoop > Issue Type: Bug > Components: mapred > Affects Versions: 0.11.0 > Reporter: Owen O'Malley > Fix For: 0.12.0 > > > We had a case where the random writer example was killed by speculative > execution. It happened like: > task_0001_m_000123_0 -> starts > task_0001_m_000123_1 -> starts and fails because attempt 0 is creating the > file > task_0001_m_000123_2 -> starts and fails because attempt 0 is creating the > file > task_0001_m_000123_3 -> starts and fails because attempt 0 is creating the > file > task_0001_m_000123_4 -> starts and fails because attempt 0 is creating the > file > job_0001 is killed because map_000123 failed 4 times. From this experience, I > think we should change the scheduling so that: > 1. Tasks are only allowed 1 speculative attempt. > 2. TIPs don't kill jobs until they have 4 failures AND the last task under > that tip fails. > Thoughts? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.