[ 
https://issues.apache.org/jira/browse/TEZ-1734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14200411#comment-14200411
 ] 

Hitesh Shah commented on TEZ-1734:
----------------------------------

bq. Because I think it would be possible for vertex go to failed from new with 
recovered events not empty ( Get RootInputFormation from InputIntializer, and 
then failed before inited ), otherwise 
TestVertexRecovery.testRecovery_RecoveringFromNew2Failed will fail.

Does this mean that in certain scenarios such as FAILED and KILLED, we should 
ignore recovered events. What about other states such as NEW, etc? 

bq. The reason is that we can not move vertex to running before its parents 
move to running. So in the patch I check whether the recoveryStartEventSeen is 
true, if it is true, that means it is started, and its parents must also 
started, in this case we could move the vertex to running and recover its tasks.
The vertex should only move to RUNNING if recoveryStartEventSeen is set to true 
( and parent vertices have recovered ). I think this case may already be 
handled in the recovery transition for non-root vertices where the parent 
vertex states are checked. The invalid state transition is interesting - did 
that happen only in a unit test? For this to be reproducible in a real world 
scenario, the parent vertex and child vertex would be both in a running state 
when the first AM got killed. In the second attempt, assuming the child Vertex 
recovered to running, the above would only occur if the parent vertex somehow 
did not recover to new/inited instead of running ( and later moved to running ) 
or it sent a source vertex started event for a second time to the child vertex. 






> Vertex's taskNum may be -1 when recovered from NEW to FAILED/KILLED
> -------------------------------------------------------------------
>
>                 Key: TEZ-1734
>                 URL: https://issues.apache.org/jira/browse/TEZ-1734
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.5.1
>            Reporter: Jeff Zhang
>            Assignee: Jeff Zhang
>         Attachments: TEZ-1734-2.patch, TEZ-1734.patch
>
>
> When vertex recovered from NEW to FAILED/KILLED, the taskNum may be -1, in 
> this case, we don't need to recover its tasks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to