[
https://issues.apache.org/jira/browse/TEZ-1772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14209292#comment-14209292
]
Jeff Zhang commented on TEZ-1772:
---------------------------------
bq. Is there a reason why numTasks is not written to both summary and
non-summary?
No special reason, just want to make less change.
bq. A vertexFinished event in recovery log is only expected to be seen after
all tasks/task attempts completions are seen. A summary event is seen out of
order and therefore when trying to recover a completed vertex, using just a
summary event will only give partial info. The summary event is mainly to catch
commit in progress errors.
Then in the case that summary event of VertexFinished is seen but non-summary
VertexFinshed event not seen. Should we recover all its tasks to desired state
rather than recovering it to failed as in the above code ?
> Failing tests post TEZ-1737
> ---------------------------
>
> Key: TEZ-1772
> URL: https://issues.apache.org/jira/browse/TEZ-1772
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Hitesh Shah
> Assignee: Jeff Zhang
> Priority: Blocker
> Attachments: TEZ-1772.patch
>
>
> org.apache.tez.test.TestAMRecovery.testVertexCompletelyFinished_One2One
> org.apache.tez.test.TestAMRecovery.testVertexCompletelyFinished_Broadcast
> org.apache.tez.test.TestDAGRecovery.testBasicRecovery
> {code}
> 2014-11-13 08:30:58,720 ERROR [AsyncDispatcher event handler]
> impl.VertexImpl: Exception in VertexManager,
> vertex=vertex_1415838634393_0001_1_01 [v2]
> org.apache.tez.dag.app.dag.impl.AMUserCodeException:
> org.apache.tez.dag.api.TezUncheckedException: Managed task number must equal
> 1-1 source task number, oneToOneSrcTaskCount =0,numManagedTasks=2
> at
> org.apache.tez.dag.app.dag.impl.VertexManager.onVertexStarted(VertexManager.java:368)
> at
> org.apache.tez.dag.app.dag.impl.VertexImpl.recoveryCodeSimulatingStart(VertexImpl.java:2417)
> at
> org.apache.tez.dag.app.dag.impl.VertexImpl.access$9(VertexImpl.java:2416)
> at
> org.apache.tez.dag.app.dag.impl.VertexImpl$RecoverTransition.transition(VertexImpl.java:2721)
> at
> org.apache.tez.dag.app.dag.impl.VertexImpl$RecoverTransition.transition(VertexImpl.java:1)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at
> org.apache.tez.state.StateMachineTez.doTransition(StateMachineTez.java:57)
> at
> org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:1526)
> at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:1)
> at
> org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1741)
> at
> org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.tez.dag.api.TezUncheckedException: Vertex=v2Managed
> task number must equal 1-1 source task number, oneToOneSrcTaskCount
> =0,numManagedTasks=2
> at
> org.apache.tez.dag.library.vertexmanager.InputReadyVertexManager.onVertexStarted(InputReadyVertexManager.java:114)
> at
> org.apache.tez.test.TestAMRecovery$ControlledInputReadyVertexManager.onVertexStarted(TestAMRecovery.java:520)
> at
> org.apache.tez.dag.app.dag.impl.VertexManager.onVertexStarted(VertexManager.java:365)
> ... 16 more
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)