[
https://issues.apache.org/jira/browse/TEZ-2958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jason Lowe updated TEZ-2958:
----------------------------
Attachment: TEZ-2958.001.patch
We recently ran into this in a recovery scenario, and it was very confusing to
users. Attaching a patch that should emit a new task attempt finished event to
override the previous successful event with a killed event when we can't
recover the task output.
I believe this will also fix another bug, since I noticed that before this
patch task attempts that have been recovered will never emit another task
finished event because recoverData != null. Therefore I believe the following
scenario can happen:
- Task attempt succeeds
- AM crashes
- Task attempt is recovered in the SUCCEEDED state with recoverData != null
- Task attempt is retroactively failed due to fetch failures
- Task attempt will _not_ emit an unsuccessful completion event because
recoverData != null, so it will remain in the UI as succeeded. The task will
end up with multiple, successful attempts.
Pinging [~zjffdu] to see if the patch makes sense.
> Recovered TA, whose commit cannot be recovered, should move to killed state
> ---------------------------------------------------------------------------
>
> Key: TEZ-2958
> URL: https://issues.apache.org/jira/browse/TEZ-2958
> Project: Apache Tez
> Issue Type: Sub-task
> Affects Versions: 0.7.0
> Reporter: Bikas Saha
> Attachments: TEZ-2958.001.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)