[
https://issues.apache.org/jira/browse/TEZ-1734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14200411#comment-14200411
]
Hitesh Shah commented on TEZ-1734:
----------------------------------
bq. Because I think it would be possible for vertex go to failed from new with
recovered events not empty ( Get RootInputFormation from InputIntializer, and
then failed before inited ), otherwise
TestVertexRecovery.testRecovery_RecoveringFromNew2Failed will fail.
Does this mean that in certain scenarios such as FAILED and KILLED, we should
ignore recovered events. What about other states such as NEW, etc?
bq. The reason is that we can not move vertex to running before its parents
move to running. So in the patch I check whether the recoveryStartEventSeen is
true, if it is true, that means it is started, and its parents must also
started, in this case we could move the vertex to running and recover its tasks.
The vertex should only move to RUNNING if recoveryStartEventSeen is set to true
( and parent vertices have recovered ). I think this case may already be
handled in the recovery transition for non-root vertices where the parent
vertex states are checked. The invalid state transition is interesting - did
that happen only in a unit test? For this to be reproducible in a real world
scenario, the parent vertex and child vertex would be both in a running state
when the first AM got killed. In the second attempt, assuming the child Vertex
recovered to running, the above would only occur if the parent vertex somehow
did not recover to new/inited instead of running ( and later moved to running )
or it sent a source vertex started event for a second time to the child vertex.
> Vertex's taskNum may be -1 when recovered from NEW to FAILED/KILLED
> -------------------------------------------------------------------
>
> Key: TEZ-1734
> URL: https://issues.apache.org/jira/browse/TEZ-1734
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.5.1
> Reporter: Jeff Zhang
> Assignee: Jeff Zhang
> Attachments: TEZ-1734-2.patch, TEZ-1734.patch
>
>
> When vertex recovered from NEW to FAILED/KILLED, the taskNum may be -1, in
> this case, we don't need to recover its tasks
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)