[ 
https://issues.apache.org/jira/browse/TEZ-2958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated TEZ-2958:
----------------------------
    Attachment: TEZ-2958.001.patch

We recently ran into this in a recovery scenario, and it was very confusing to 
users.  Attaching a patch that should emit a new task attempt finished event to 
override the previous successful event with a killed event when we can't 
recover the task output.

I believe this will also fix another bug, since I noticed that before this 
patch task attempts that have been recovered will never emit another task 
finished event because recoverData != null.  Therefore I believe the following 
scenario can happen:
- Task attempt succeeds
- AM crashes
- Task attempt is recovered in the SUCCEEDED state with recoverData != null
- Task attempt is retroactively failed due to fetch failures
- Task attempt will _not_ emit an unsuccessful completion event because 
recoverData != null, so it will remain in the UI as succeeded.  The task will 
end up with multiple, successful attempts.

Pinging [~zjffdu] to see if the patch makes sense.

> Recovered TA, whose commit cannot be recovered, should move to killed state
> ---------------------------------------------------------------------------
>
>                 Key: TEZ-2958
>                 URL: https://issues.apache.org/jira/browse/TEZ-2958
>             Project: Apache Tez
>          Issue Type: Sub-task
>    Affects Versions: 0.7.0
>            Reporter: Bikas Saha
>         Attachments: TEZ-2958.001.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to