[
https://issues.apache.org/jira/browse/TEZ-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237608#comment-14237608
]
Jeff Zhang commented on TEZ-1019:
---------------------------------
Upload new path, [~hitesh] please help review it.
* The new patch change a lot on the recovery of vertex. I remove the RECOVERING
state and trigger the recovery from root vertex. The down-stream vertex should
be able start its own recovery automatically with the events from up-stream
like in normal flow. I move the recovery work into normal transition (mainly in
InitTransition & StartTransition). I just take the recovery events as the redo
logs and use these recovery event to init and start vertex.
* I only make it pass TestAMRecovery and manually test some examples in
tez-examples. ( TestVertexRecovery don't pass now, please just help review
whether this approach work, whether I miss some cases. )
* Besides this, I have 2 questions about the vertex recovery
** In the existing code, we will recovery task when vertex's recovered state is
inited, not sure why, I just remove it in the new patch.
** when vertex's recoveredState is RUNNING, we will still check the numTasks.
As my understanding, numTasks wouldn't been 0 when it is in RUNNING, otherwise
that means init is not completed.
{code}
assert vertex.tasks.size() == vertex.numTasks;
if (vertex.tasks != null && vertex.numTasks != 0) {
for (Task task : vertex.tasks.values()) {
vertex.eventHandler.handle(
new TaskEventRecoverTask(task.getTaskId()));
}
try {
vertex.recoveryCodeSimulatingStart();
endState = VertexState.RUNNING;
} catch (AMUserCodeException e) {
String msg = "Exception in " + e.getSource() + ", vertex:" +
vertex.getLogIdentifier();
LOG.error(msg, e);
vertex.finished(VertexState.FAILED,
VertexTerminationCause.AM_USERCODE_FAILURE,
msg + ", " + ExceptionUtils.getStackTrace(e.getCause()));
endState = VertexState.FAILED;
}
} else {
// why succeeded here
endState = VertexState.SUCCEEDED;
vertex.finished(endState);
}
{code}
> Re-factor routing of events to use common code path for normal and recovery
> flow.
> ---------------------------------------------------------------------------------
>
> Key: TEZ-1019
> URL: https://issues.apache.org/jira/browse/TEZ-1019
> Project: Apache Tez
> Issue Type: Sub-task
> Reporter: Hitesh Shah
> Attachments: TEZ-1019-2.patch, Tez-1019.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)