[jira] [Commented] (TEZ-3102) Fetch failure of a speculated task causes job hang

Bikas Saha (JIRA) Fri, 12 Feb 2016 14:29:46 -0800

    [ 
https://issues.apache.org/jira/browse/TEZ-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15145437#comment-15145437
 ]


Bikas Saha commented on TEZ-3102:
---------------------------------

>> Note that this isn't an attempt that finished then was retroactively killed, 
>> rather it's an active attempt that is being killed and the task receives the 
>> kill event while already in the SUCCEEDED state.
Yes. I understand that.

>> The logic for retroactively killing a successful attempt that already 
>> completed is correct – it will clear the task status and reschedule a new 
>> attempt, so I don't think there's a bug in that case
Yes. As of today but its susceptible to going out of sync. The patch is syncing 
up 2 cases but not this third case (because it duplicated the code already).

>> Reused AttemptKilled.transition logic will schedule a new attempt since 
>> task.shouldScheduleNewAttempt() will return true (as there will no longer be 
>> any active attempt).
This will not since the shouldScheduleNewAttempt() checks for successful 
attempt existence.
{code}  private boolean shouldScheduleNewAttempt() {
    return (getUncompletedAttemptsCount() == 0
            && successfulAttempt == null);
  }{code}

Does that clarify?

> Fetch failure of a speculated task causes job hang
> --------------------------------------------------
>
>                 Key: TEZ-3102
>                 URL: https://issues.apache.org/jira/browse/TEZ-3102
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: TEZ-3102.001.patch
>
>
> If a task speculates then succeeds, one task will be marked successful and 
> the other killed. Then if the task retroactively fails due to fetch failures 
> the Tez AM will fail to reschedule another task. This results in a hung job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-3102) Fetch failure of a speculated task causes job hang

Reply via email to