[
https://issues.apache.org/jira/browse/TEZ-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15145224#comment-15145224
]
Jason Lowe commented on TEZ-3102:
---------------------------------
Note that this isn't an attempt that finished then was retroactively killed,
rather it's an active attempt that is being killed and the task receives the
kill event while already in the SUCCEEDED state. The logic for retroactively
killing a successful attempt that already completed is correct -- it will clear
the task status and reschedule a new attempt, so I don't think there's a bug in
that case.
bq. What if we try to execute the entire body of the AttemptKilledTransition
from RetroActiveKilledTransition for both cases?
The problem is that then we would schedule a new attempt after the task
succeeded when the speculative attempt is killed. Here's the scenario:
# Attempt 1 succeeds, task goes to the SUCCEEDED state, and we send kill to
attempt 2
# Attempt 2 is killed and sends killed attempt event back to task in SUCCEEDED
state
# Reused AttemptKilled.transition logic will schedule a new attempt since
task.shouldScheduleNewAttempt() will return true (as there will no longer be
any active attempt).
# Now we have an unnecessary task attempt running for a successful task
> Fetch failure of a speculated task causes job hang
> --------------------------------------------------
>
> Key: TEZ-3102
> URL: https://issues.apache.org/jira/browse/TEZ-3102
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.7.0
> Reporter: Jason Lowe
> Assignee: Jason Lowe
> Priority: Critical
> Attachments: TEZ-3102.001.patch
>
>
> If a task speculates then succeeds, one task will be marked successful and
> the other killed. Then if the task retroactively fails due to fetch failures
> the Tez AM will fail to reschedule another task. This results in a hung job.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)