[ https://issues.apache.org/jira/browse/TEZ-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14070856#comment-14070856 ]
Hitesh Shah commented on TEZ-1273: ---------------------------------- Initial set of comments: - whitespaces need to be cleaned - lot of lines have trailing whitespaces. - DAGAppMasterEventType - has start_shutdown, final_shutdown and exit. What are the differences? Why are 3 shutdown events needed? Could you add more docs. - DAGAppMasterState - what about session states? Can a session not go into a failed state? You may want to look at TEZ-1117? - DAGAppMasterEventDAGFinished.java - extra license in the same file? - TaskSchedulerEventHandler.java - might be better to move the logic of converting AM state to YARN state into the DagAppMaster class? Does that make more sense? - A better name than TIME_OUT to indicate what the event really means. - State machine questions - there is a transition from idle to idle on dag submitted. Why would that ever be the case? - where is the exit event used? - Is there a reason why dag killed, dag failed, dag succeeded states are needed? It seems like these are just different final states but the actual work to be done should be the same in a transition from dag finished to terminating? - From state recovering, if there is no dag to recover, should it not go to session idle? - what if the client tries to submit a dag when the AM is shutting down or in a non-idle state? - could you clarify which events are sent async and which ones are handled in sync? The mix of the 2 could be confusing and create problems. - how and where is AM relocalization being handled? - initAndStartAppMaster() - it calls Service related functions instead of sending state machine events? Furthermore, what if the handling of the init event fails ? this code still invokes start() regardless of what happened in init(). Could you confirm that you have a unit test for the scenarios where init fails, start fails, etc. {code} //notify the eventhandler of state change if (oldState != getState()) { LOG.info("DAGAppMaster transitioned from " + oldState + " to " + getState()); } {code} - better to also log what event type caused the state change to happen. i.e. DAGAppMaster transitioned from oldState to newState due to eventType. - Spelling mistakes - DAGSUBMIT_TRANSISION - DAG_SUBMITED - submition - RecoverTransition needs fixiing - in non-session state, if no recovery data is found, the dag submitted via the application submission context should be started - in session state, if there is no dag found, it should go into idle state - where is the dag submit timeout timer being set up or reset? - Question on startPreWarmContainers() - error handling if dag submission fails? - for submitDAGToAppMaster() - maybe check for session running instead of checking for session idle - it is safer to check for session running in case the state machine changes to move to other states on a submit failure - internalStart() - what if some services fail to start? - StartShutdownTransition - why is the unregister flag being set? Especially in a case where the AM was asked to reboot? - DAGAppMasterShutdownHook - why is the AM being unregistered? This hook can be invoked in multiple scenarios - OOM, etc. - NonSessionDAGFinishTransition - there is a log message which says that the AM will unregister. However, the RM unregister flag is not being set. It might be good to list out the various test cases that need to be written to ensure that existing behavior is not broken. > Refactor DAGAppMaster to state machine based > -------------------------------------------- > > Key: TEZ-1273 > URL: https://issues.apache.org/jira/browse/TEZ-1273 > Project: Apache Tez > Issue Type: Improvement > Affects Versions: 0.4.0 > Reporter: Jeff Zhang > Assignee: Jeff Zhang > Attachments: Tez-1273-2.patch, Tez-1273.patch, dag_app_master.pdf, > dag_app_master2.pdf > > > Almost all our entities (Vertex, Task etc) are state machine based and > written using a formal state machine. But DAGAppMaster is not written on a > formal state machine even though it has a state machine based behavior. This > jira is for refactoring it into state machine based -- This message was sent by Atlassian JIRA (v6.2#6252)