[
https://issues.apache.org/jira/browse/TEZ-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15145437#comment-15145437
]
Bikas Saha commented on TEZ-3102:
---------------------------------
>> Note that this isn't an attempt that finished then was retroactively killed,
>> rather it's an active attempt that is being killed and the task receives the
>> kill event while already in the SUCCEEDED state.
Yes. I understand that.
>> The logic for retroactively killing a successful attempt that already
>> completed is correct – it will clear the task status and reschedule a new
>> attempt, so I don't think there's a bug in that case
Yes. As of today but its susceptible to going out of sync. The patch is syncing
up 2 cases but not this third case (because it duplicated the code already).
>> Reused AttemptKilled.transition logic will schedule a new attempt since
>> task.shouldScheduleNewAttempt() will return true (as there will no longer be
>> any active attempt).
This will not since the shouldScheduleNewAttempt() checks for successful
attempt existence.
{code} private boolean shouldScheduleNewAttempt() {
return (getUncompletedAttemptsCount() == 0
&& successfulAttempt == null);
}{code}
Does that clarify?
> Fetch failure of a speculated task causes job hang
> --------------------------------------------------
>
> Key: TEZ-3102
> URL: https://issues.apache.org/jira/browse/TEZ-3102
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.7.0
> Reporter: Jason Lowe
> Assignee: Jason Lowe
> Priority: Critical
> Attachments: TEZ-3102.001.patch
>
>
> If a task speculates then succeeds, one task will be marked successful and
> the other killed. Then if the task retroactively fails due to fetch failures
> the Tez AM will fail to reschedule another task. This results in a hung job.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)