[
https://issues.apache.org/jira/browse/TEZ-2581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15023204#comment-15023204
]
Bikas Saha commented on TEZ-2581:
---------------------------------
bq. It is not possible to move from RUNNING to FAILED, but still possible for
move from RUNNING to RUNNING
How do we get RUNNING to RUNNING. If I understand correctly, we go into running
state from scheduled in 1) normal execution to schedule first attempt 2)
recovery of commit fails and so we schedule an new attempt. So we will not be
in a situation where commit failed to be recovered in running state since
attempt_succeeded event will only come from a real running task. Right? If yes,
then this is probably the last thing to fix in the patch :)
bq. Not sure what the failed state means. There's one field to track the
finished state of the Task.
Lets say in AM1. First task failed. It wrote failure to recovery log. Then
because of this dag failed and before it moves to failed state (and write
summary log) it has to kill running tasks. While killing running tasks, AM1
dies. Now there is no dag failed summary event, but the non-summary log has
task state saved as failed. this is the scenario.
So what the patch is saying is that we dont have a case where the task fails
before writing the started event. So if we have not seen the started events
then the task can only be killed. Otherwise, if it sees the start event on
recovery then it will go to scheduled and handle attempt_failed event from the
recovered ta. That seems fine. I was over-thinking in my last comment :P
bq. I mean TA_DONE event for speculative TA may be sent before the KILL EVENT.
Then the kill will transition the ta from succeeded to killed. If the kill
reaches first then the done will be ignored in the killed state.
> Umbrella for Tez Recovery Redesign
> ----------------------------------
>
> Key: TEZ-2581
> URL: https://issues.apache.org/jira/browse/TEZ-2581
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Jeff Zhang
> Assignee: Jeff Zhang
> Attachments: TEZ-2581-WIP-1.patch, TEZ-2581-WIP-10.patch,
> TEZ-2581-WIP-11.patch, TEZ-2581-WIP-12.patch, TEZ-2581-WIP-13.patch,
> TEZ-2581-WIP-14.patch, TEZ-2581-WIP-15.patch, TEZ-2581-WIP-2.patch,
> TEZ-2581-WIP-3.patch, TEZ-2581-WIP-4.patch, TEZ-2581-WIP-5.patch,
> TEZ-2581-WIP-6.patch, TEZ-2581-WIP-7.patch, TEZ-2581-WIP-8.patch,
> TEZ-2581-WIP-9.patch, TezRecoveryRedesignProposal.pdf,
> TezRecoveryRedesignV1.1.pdf
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)