[ 
https://issues.apache.org/jira/browse/TEZ-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237608#comment-14237608
 ] 

Jeff Zhang edited comment on TEZ-1019 at 12/8/14 8:28 AM:
----------------------------------------------------------

Upload new path, [~hitesh] please help review the main approach ( unit test is 
not completed yet )


* The new patch change a lot on the recovery of vertex. I remove the RECOVERING 
state and trigger the recovery from root vertex. The down-stream vertex should 
be able start its own recovery automatically with the events from up-stream 
like in normal flow. I move the recovery work into normal transition (mainly in 
InitTransition & StartTransition). I just take the recovery events as the redo 
logs and use these recovery event to init and start vertex.
* I only make it pass TestAMRecovery and manually test some examples in 
tez-examples. ( TestVertexRecovery don't pass now, please just help review 
whether this approach work, whether I miss some cases. )
* Besides this, I have 2 questions about the vertex recovery
** In the existing code, we will recovery task when vertex's recovered state is 
inited, not sure why, I just remove it in the new patch.
** when vertex's recoveredState is RUNNING, we will still check the numTasks. 
As my understanding, numTasks wouldn't been 0 when it is in RUNNING, otherwise 
that means init is not completed.

{code}
          assert vertex.tasks.size() == vertex.numTasks;
          if (vertex.tasks != null && vertex.numTasks != 0) {
            for (Task task : vertex.tasks.values()) {
              vertex.eventHandler.handle(
                  new TaskEventRecoverTask(task.getTaskId()));
            }
            try {
              vertex.recoveryCodeSimulatingStart();
              endState = VertexState.RUNNING;
            } catch (AMUserCodeException e) {
              String msg = "Exception in " + e.getSource() + ", vertex:" + 
vertex.getLogIdentifier();
              LOG.error(msg, e);
              vertex.finished(VertexState.FAILED, 
VertexTerminationCause.AM_USERCODE_FAILURE,
                  msg + ", " + ExceptionUtils.getStackTrace(e.getCause()));
              endState = VertexState.FAILED;
            }
          } else {
            // why succeeded here
            endState = VertexState.SUCCEEDED;
            vertex.finished(endState);
          }
{code}


was (Author: zjffdu):
Upload new path, [~hitesh] please help review it.


* The new patch change a lot on the recovery of vertex. I remove the RECOVERING 
state and trigger the recovery from root vertex. The down-stream vertex should 
be able start its own recovery automatically with the events from up-stream 
like in normal flow. I move the recovery work into normal transition (mainly in 
InitTransition & StartTransition). I just take the recovery events as the redo 
logs and use these recovery event to init and start vertex.
* I only make it pass TestAMRecovery and manually test some examples in 
tez-examples. ( TestVertexRecovery don't pass now, please just help review 
whether this approach work, whether I miss some cases. )
* Besides this, I have 2 questions about the vertex recovery
** In the existing code, we will recovery task when vertex's recovered state is 
inited, not sure why, I just remove it in the new patch.
** when vertex's recoveredState is RUNNING, we will still check the numTasks. 
As my understanding, numTasks wouldn't been 0 when it is in RUNNING, otherwise 
that means init is not completed.

{code}
          assert vertex.tasks.size() == vertex.numTasks;
          if (vertex.tasks != null && vertex.numTasks != 0) {
            for (Task task : vertex.tasks.values()) {
              vertex.eventHandler.handle(
                  new TaskEventRecoverTask(task.getTaskId()));
            }
            try {
              vertex.recoveryCodeSimulatingStart();
              endState = VertexState.RUNNING;
            } catch (AMUserCodeException e) {
              String msg = "Exception in " + e.getSource() + ", vertex:" + 
vertex.getLogIdentifier();
              LOG.error(msg, e);
              vertex.finished(VertexState.FAILED, 
VertexTerminationCause.AM_USERCODE_FAILURE,
                  msg + ", " + ExceptionUtils.getStackTrace(e.getCause()));
              endState = VertexState.FAILED;
            }
          } else {
            // why succeeded here
            endState = VertexState.SUCCEEDED;
            vertex.finished(endState);
          }
{code}

> Re-factor routing of events to use common code path for normal and recovery 
> flow.
> ---------------------------------------------------------------------------------
>
>                 Key: TEZ-1019
>                 URL: https://issues.apache.org/jira/browse/TEZ-1019
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Hitesh Shah
>         Attachments: TEZ-1019-2.patch, Tez-1019.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to