[
https://issues.apache.org/jira/browse/TEZ-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555353#comment-14555353
]
Jeff Zhang commented on TEZ-1273:
---------------------------------
bq. Should there be 2 events - RECOVER and RECOVER_FAILED to handle recovery
errors?
The recovering scenario here is a little difference from vertex initializing.
Because the recovering happens in the same thread as dispatcher while vertex
initializing is in a separated thread. After RecoverTransition, we already know
whether the recovering is successful. so could move to next state directly
rather than sending out RECOVER or RECOVER_FAILED indirectly.
{code}
try {
appMaster.recover();
// start the sessionTimeoutChecker after the recover is done to avoid
the session timeout
// during recovering.
appMaster.startSessiontimeoutChecker();
return DAGAppMasterState.RUNNING;
} catch (Exception e) {
LOG.error("Error occurred when trying to recover data from previous
attempt."
+ " Shutting down AM", e);
appMaster.taskSchedulerEventHandler.setShouldUnregisterFlag();
appMaster.shutdownHandler.shutdown();
return DAGAppMasterState.ERROR;
}
{code}
bq. No dag cleanup event handling in failed?
Dag cleanup only happens when AM can move to IDLE, if AM move to failed, that
means we have started the shutdown, so dag cleanup is not necessary. This is
just like the scenrio of non-session mode. In non-session mode we don't have
dag cleanup, because AM would be shutdown after dag completion.
bq. register and unregister with RM are not states. Should they be?
What's the purpose of creating new states for register and unregister with RM ?
bq. Which services should be active and non active in the recovering state? e.g
DagClientHandler?
all the service are active in recovering state. For TEZ-2375, after this
ticket, we know that whether AM is in recovering state, so could return a
recovering response to client.
bq. running remains in running state on events such as internal error and
shutdown - should a new terminating state be introduced?
Make sense. Terminating will wait for the current dag to be completed.
Document will be added later.
> Refactor DAGAppMaster to state machine based
> --------------------------------------------
>
> Key: TEZ-1273
> URL: https://issues.apache.org/jira/browse/TEZ-1273
> Project: Apache Tez
> Issue Type: Improvement
> Affects Versions: 0.4.0
> Reporter: Jeff Zhang
> Assignee: Jeff Zhang
> Attachments: DAGAppMaster_3.pdf, DAGAppMaster_4.pdf,
> TEZ-1273-3.patch, TEZ-1273-4.patch, TEZ-1273-5.patch, TEZ-1273-6.patch,
> TEZ-1273-7.patch, Tez-1273-2.patch, Tez-1273.patch, dag_app_master.pdf,
> dag_app_master2.pdf
>
>
> Almost all our entities (Vertex, Task etc) are state machine based and
> written using a formal state machine. But DAGAppMaster is not written on a
> formal state machine even though it has a state machine based behavior. This
> jira is for refactoring it into state machine based
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)