[ 
https://issues.apache.org/jira/browse/TEZ-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15145094#comment-15145094
 ] 

Bikas Saha commented on TEZ-3102:
---------------------------------

Good catch! The bug was that standard attempt killed processing was not being 
applied to retroactively killed attempts.

Reading the code, it looks like some similar inconsistency may still be 
possible if the successful attempt gets retroactively killed (the patch is 
fixing the case where the non-successful attempt is retroactively killed). 

What if we try to execute the entire body of the AttemptKilledTransition from 
RetroActiveKilledTransition for both cases? E.g. 
if (matches success) { unsucceed(), notifyVertex }; 
call into AttemptKilled.transitionLogic(); // fall through to doing the same 
thing for both cases

Thoughts?


> Fetch failure of a speculated task causes job hang
> --------------------------------------------------
>
>                 Key: TEZ-3102
>                 URL: https://issues.apache.org/jira/browse/TEZ-3102
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: TEZ-3102.001.patch
>
>
> If a task speculates then succeeds, one task will be marked successful and 
> the other killed. Then if the task retroactively fails due to fetch failures 
> the Tez AM will fail to reschedule another task. This results in a hung job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to