[ 
https://issues.apache.org/jira/browse/TEZ-2581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15023204#comment-15023204
 ] 

Bikas Saha commented on TEZ-2581:
---------------------------------

bq. It is not possible to move from RUNNING to FAILED, but still possible for 
move from RUNNING to RUNNING
How do we get RUNNING to RUNNING. If I understand correctly, we go into running 
state from scheduled in 1) normal execution to schedule first attempt 2) 
recovery of commit fails and so we schedule an new attempt. So we will not be 
in a situation where commit failed to be recovered in running state since 
attempt_succeeded event will only come from a real running task. Right? If yes, 
then this is probably the last thing to fix in the patch :)

bq. Not sure what the failed state means. There's one field to track the 
finished state of the Task.
Lets say in AM1. First task failed. It wrote failure to recovery log. Then 
because of this dag failed and before it moves to failed state (and write 
summary log) it has to kill running tasks. While killing running tasks, AM1 
dies. Now there is no dag failed summary event, but the non-summary log has 
task state saved as failed. this is the scenario.
So what the patch is saying is that we dont have a case where the task fails 
before writing the started event. So if we have not seen the started events 
then the task can only be killed. Otherwise, if it sees the start event on 
recovery then it will go to scheduled and handle attempt_failed event from the 
recovered ta. That seems fine. I was over-thinking in my last comment :P

bq. I mean TA_DONE event for speculative TA may be sent before the KILL EVENT.
Then the kill will transition the ta from succeeded to killed. If the kill 
reaches first then the done will be ignored in the killed state.

> Umbrella for Tez Recovery Redesign
> ----------------------------------
>
>                 Key: TEZ-2581
>                 URL: https://issues.apache.org/jira/browse/TEZ-2581
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Jeff Zhang
>            Assignee: Jeff Zhang
>         Attachments: TEZ-2581-WIP-1.patch, TEZ-2581-WIP-10.patch, 
> TEZ-2581-WIP-11.patch, TEZ-2581-WIP-12.patch, TEZ-2581-WIP-13.patch, 
> TEZ-2581-WIP-14.patch, TEZ-2581-WIP-15.patch, TEZ-2581-WIP-2.patch, 
> TEZ-2581-WIP-3.patch, TEZ-2581-WIP-4.patch, TEZ-2581-WIP-5.patch, 
> TEZ-2581-WIP-6.patch, TEZ-2581-WIP-7.patch, TEZ-2581-WIP-8.patch, 
> TEZ-2581-WIP-9.patch, TezRecoveryRedesignProposal.pdf, 
> TezRecoveryRedesignV1.1.pdf
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to