[
https://issues.apache.org/jira/browse/TEZ-1734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14200151#comment-14200151
]
Jeff Zhang edited comment on TEZ-1734 at 11/6/14 2:07 PM:
----------------------------------------------------------
[~hitesh]
bq. Could you provide more details on why this is removed?
Because I think it would be possible for vertex go to failed from new with
recovered events not empty ( Get RootInputFormation from InputIntializer, and
then failed before inited ), otherwise
TestVertexRecovery.testRecovery_RecoveringFromNew2Failed will fail.
bq. Bikas's test case issue.
The error message is
{code}
13:49:36,749 - Thread(AsyncDispatcher event handler) - (VertexImpl.java:1532) -
Can't handle Invalid event V_SOURCE_VERTEX_STARTED on vertex vertex3 with
vertexId vertex_1415252975653_0001_1_02 at current state RUNNING
org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event:
V_SOURCE_VERTEX_STARTED at RUNNING
{code}
The reason is that we can not move vertex to running before its parents move to
running. So in the patch I check whether the recoveryStartEventSeen is true, if
it is true, that means it is started, and its parents must also started, in
this case we could move the vertex to running and recover its tasks.
BTW, the recovery process is still complicated to me, plan to do more
refactoring to make it clean and easy maintain.
was (Author: zjffdu):
[~hitesh]
bq. Could you provide more details on why this is removed?
Because I think it would be possible for vertex go to failed from new with
recovered events not empty ( Get RootInputFormation from InputIntializer, and
then failed before inited ), otherwise
TestVertexRecovery.testRecovery_RecoveringFromNew2Failed will fail.
bq. Bikas's test case issue.
The reason is that we can not move vertex to running before its parents move to
running. So in the patch I check whether the recoveryStartEventSeen is true, if
it is true, that means it is started, and its parents must also started, in
this case we could move the vertex to running and recover its tasks.
BTW, the recovery process is still complicated to me, plan to do more
refactoring to make it clean and easy maintain.
> Vertex's taskNum may be -1 when recovered from NEW to FAILED/KILLED
> -------------------------------------------------------------------
>
> Key: TEZ-1734
> URL: https://issues.apache.org/jira/browse/TEZ-1734
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.5.1
> Reporter: Jeff Zhang
> Assignee: Jeff Zhang
> Attachments: TEZ-1734-2.patch, TEZ-1734.patch
>
>
> When vertex recovered from NEW to FAILED/KILLED, the taskNum may be -1, in
> this case, we don't need to recover its tasks
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)