[
https://issues.apache.org/jira/browse/TEZ-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237608#comment-14237608
]
Jeff Zhang edited comment on TEZ-1019 at 12/8/14 8:30 AM:
----------------------------------------------------------
Upload new path, [~hitesh] please help review the main approach ( unit test is
not completed yet )
* The new patch changes a lot on the recovery of vertex. I remove the
RECOVERING state and trigger the recovery from root vertex. The down-stream
vertex should be able to start its own recovery automatically with the events
from up-stream like in normal flow. I move the recovery work into normal
transition (mainly in InitTransition & StartTransition). I just take the
recovery events as the redo logs and use these recovery event to init and start
vertex.
* I only make it pass TestAMRecovery and manually test some examples in
tez-examples. ( TestVertexRecovery don't pass now, please just help review
whether this approach work, whether I miss some cases. )
* Besides this, I have 2 questions about the vertex recovery
** In the existing code, we will recover task when vertex's recovered state is
inited, not sure why, I just remove it in the new patch. As my understanding,
if it is in INITED, there should be no task running, we don't need to recover
task here.
** when vertex's recoveredState is RUNNING, we will still check the numTasks.
As my understanding, numTasks wouldn't been 0 when it is in RUNNING, otherwise
that means init is not completed.
{code}
assert vertex.tasks.size() == vertex.numTasks;
if (vertex.tasks != null && vertex.numTasks != 0) {
for (Task task : vertex.tasks.values()) {
vertex.eventHandler.handle(
new TaskEventRecoverTask(task.getTaskId()));
}
try {
vertex.recoveryCodeSimulatingStart();
endState = VertexState.RUNNING;
} catch (AMUserCodeException e) {
String msg = "Exception in " + e.getSource() + ", vertex:" +
vertex.getLogIdentifier();
LOG.error(msg, e);
vertex.finished(VertexState.FAILED,
VertexTerminationCause.AM_USERCODE_FAILURE,
msg + ", " + ExceptionUtils.getStackTrace(e.getCause()));
endState = VertexState.FAILED;
}
} else {
// why succeeded here
endState = VertexState.SUCCEEDED;
vertex.finished(endState);
}
{code}
was (Author: zjffdu):
Upload new path, [~hitesh] please help review the main approach ( unit test is
not completed yet )
* The new patch changes a lot on the recovery of vertex. I remove the
RECOVERING state and trigger the recovery from root vertex. The down-stream
vertex should be able start its own recovery automatically with the events from
up-stream like in normal flow. I move the recovery work into normal transition
(mainly in InitTransition & StartTransition). I just take the recovery events
as the redo logs and use these recovery event to init and start vertex.
* I only make it pass TestAMRecovery and manually test some examples in
tez-examples. ( TestVertexRecovery don't pass now, please just help review
whether this approach work, whether I miss some cases. )
* Besides this, I have 2 questions about the vertex recovery
** In the existing code, we will recovery task when vertex's recovered state is
inited, not sure why, I just remove it in the new patch.
** when vertex's recoveredState is RUNNING, we will still check the numTasks.
As my understanding, numTasks wouldn't been 0 when it is in RUNNING, otherwise
that means init is not completed.
{code}
assert vertex.tasks.size() == vertex.numTasks;
if (vertex.tasks != null && vertex.numTasks != 0) {
for (Task task : vertex.tasks.values()) {
vertex.eventHandler.handle(
new TaskEventRecoverTask(task.getTaskId()));
}
try {
vertex.recoveryCodeSimulatingStart();
endState = VertexState.RUNNING;
} catch (AMUserCodeException e) {
String msg = "Exception in " + e.getSource() + ", vertex:" +
vertex.getLogIdentifier();
LOG.error(msg, e);
vertex.finished(VertexState.FAILED,
VertexTerminationCause.AM_USERCODE_FAILURE,
msg + ", " + ExceptionUtils.getStackTrace(e.getCause()));
endState = VertexState.FAILED;
}
} else {
// why succeeded here
endState = VertexState.SUCCEEDED;
vertex.finished(endState);
}
{code}
> Re-factor routing of events to use common code path for normal and recovery
> flow.
> ---------------------------------------------------------------------------------
>
> Key: TEZ-1019
> URL: https://issues.apache.org/jira/browse/TEZ-1019
> Project: Apache Tez
> Issue Type: Sub-task
> Reporter: Hitesh Shah
> Attachments: TEZ-1019-2.patch, Tez-1019.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)