[
https://issues.apache.org/jira/browse/TEZ-1772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14209937#comment-14209937
]
Hitesh Shah commented on TEZ-1772:
----------------------------------
bq. But even the non-summary VertexFinishedEvent is seen, its
VertexRecoverableEventsGeneratedEvent may still lost. I think there's no
guaranteed that VertexRecoverableEventsGeneratedEvent is logged before
VertexFinishedEvent.
The expectation was that all tasks are completed before a vertex has finished.
Also, a TaskFinishedEvent is only seen after all its datamovement events are
generated and therefore logged.
The handling for for the general case where there are a lot of data movement
events generated, commit started and then ended. In a scenario, where commit
starts but does not end, the summary log helps catch the problem. Now, in a
scenario, where commit finished successfully, there could be a situation where
the AM crashed before all data movements are stored to recovery. In this
scenario, we cannot do anything as the commit has already been done but we have
no idea what was lost. The main crux to answer your question is that a
committer cannot be invoked twice.
Agree that VertexRecoverableEventsGeneratedEvent is a different problem. In
such cases, I believe that if VertexRecoverableEventsGeneratedEvent is not seen
before a VertexFinished is seen, there needs to be some additional handling for
that scenario too. If a VertexRecoverableEventsGeneratedEvent is always
guaranteed to be generated for a vertex and it is not seen, then that means it
is a potential non-recoverable case when the vertex itself was seen to have
been completed.
> Failing tests post TEZ-1737
> ---------------------------
>
> Key: TEZ-1772
> URL: https://issues.apache.org/jira/browse/TEZ-1772
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Hitesh Shah
> Assignee: Jeff Zhang
> Priority: Blocker
> Attachments: TEZ-1772-2.patch, TEZ-1772.patch
>
>
> org.apache.tez.test.TestAMRecovery.testVertexCompletelyFinished_One2One
> org.apache.tez.test.TestAMRecovery.testVertexCompletelyFinished_Broadcast
> org.apache.tez.test.TestDAGRecovery.testBasicRecovery
> {code}
> 2014-11-13 08:30:58,720 ERROR [AsyncDispatcher event handler]
> impl.VertexImpl: Exception in VertexManager,
> vertex=vertex_1415838634393_0001_1_01 [v2]
> org.apache.tez.dag.app.dag.impl.AMUserCodeException:
> org.apache.tez.dag.api.TezUncheckedException: Managed task number must equal
> 1-1 source task number, oneToOneSrcTaskCount =0,numManagedTasks=2
> at
> org.apache.tez.dag.app.dag.impl.VertexManager.onVertexStarted(VertexManager.java:368)
> at
> org.apache.tez.dag.app.dag.impl.VertexImpl.recoveryCodeSimulatingStart(VertexImpl.java:2417)
> at
> org.apache.tez.dag.app.dag.impl.VertexImpl.access$9(VertexImpl.java:2416)
> at
> org.apache.tez.dag.app.dag.impl.VertexImpl$RecoverTransition.transition(VertexImpl.java:2721)
> at
> org.apache.tez.dag.app.dag.impl.VertexImpl$RecoverTransition.transition(VertexImpl.java:1)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at
> org.apache.tez.state.StateMachineTez.doTransition(StateMachineTez.java:57)
> at
> org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:1526)
> at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:1)
> at
> org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1741)
> at
> org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.tez.dag.api.TezUncheckedException: Vertex=v2Managed
> task number must equal 1-1 source task number, oneToOneSrcTaskCount
> =0,numManagedTasks=2
> at
> org.apache.tez.dag.library.vertexmanager.InputReadyVertexManager.onVertexStarted(InputReadyVertexManager.java:114)
> at
> org.apache.tez.test.TestAMRecovery$ControlledInputReadyVertexManager.onVertexStarted(TestAMRecovery.java:520)
> at
> org.apache.tez.dag.app.dag.impl.VertexManager.onVertexStarted(VertexManager.java:365)
> ... 16 more
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)