[jira] [Commented] (TEZ-3102) Fetch failure of a speculated task causes job hang

Jason Lowe (JIRA) Fri, 12 Feb 2016 12:21:17 -0800

    [ 
https://issues.apache.org/jira/browse/TEZ-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15145224#comment-15145224
 ]


Jason Lowe commented on TEZ-3102:
---------------------------------

Note that this isn't an attempt that finished then was retroactively killed, 
rather it's an active attempt that is being killed and the task receives the 
kill event while already in the SUCCEEDED state.  The logic for retroactively 
killing a successful attempt that already completed is correct -- it will clear 
the task status and reschedule a new attempt, so I don't think there's a bug in 
that case.

bq. What if we try to execute the entire body of the AttemptKilledTransition 
from RetroActiveKilledTransition for both cases?

The problem is that then we would schedule a new attempt after the task 
succeeded when the speculative attempt is killed. Here's the scenario:
# Attempt 1 succeeds, task goes to the SUCCEEDED state, and we send kill to 
attempt 2
# Attempt 2 is killed and sends killed attempt event back to task in SUCCEEDED 
state
# Reused AttemptKilled.transition logic will schedule a new attempt since 
task.shouldScheduleNewAttempt() will return true (as there will no longer be 
any active attempt).
# Now we have an unnecessary task attempt running for a successful task

> Fetch failure of a speculated task causes job hang
> --------------------------------------------------
>
>                 Key: TEZ-3102
>                 URL: https://issues.apache.org/jira/browse/TEZ-3102
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: TEZ-3102.001.patch
>
>
> If a task speculates then succeeds, one task will be marked successful and 
> the other killed. Then if the task retroactively fails due to fetch failures 
> the Tez AM will fail to reschedule another task. This results in a hung job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-3102) Fetch failure of a speculated task causes job hang

Reply via email to