[
https://issues.apache.org/jira/browse/TEZ-4474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17687286#comment-17687286
]
Mudit Sharma commented on TEZ-4474:
-----------------------------------
[~hitesh] / [~abstractdog] Please review
> DAG recovery failure leads to AM status SUCCEEDED
> -------------------------------------------------
>
> Key: TEZ-4474
> URL: https://issues.apache.org/jira/browse/TEZ-4474
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Mudit Sharma
> Priority: Major
> Attachments:
> 0001-TEZ-4474-Added-config-to-fail-the-DAG-status-when-sh.patch
>
>
> Summary of the Issue:
> When Tez DAG recovery is failed because of some reason in the second retry of
> any Tez AM, then in corner case scenario, Tez Job sets DAG state to IDLE
> Once the DAG state is set to IDLE, then after checkAndHandleSessionTimeout(),
> Tez AM will try to shutdown the DAG, and since recovery was failed so there
> will not be any running DAGs
> If there are no RUNNING DAGs and state of DAG is IDLE, then by default AM
> sets the status to SUCCEEDED, because of this if-else:
> [https://github.com/apache/tez/blob/master/tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java#L1266]
> public void shutdownTezAM(String dagKillmessage) throws TezException {
> if (!sessionStopped.compareAndSet(false, true)) {
> // No need to shutdown twice.
> // Return with a no-op if shutdownTezAM has been invoked earlier.
> return;
> }
> synchronized (this) {
> this.taskSchedulerManager.setShouldUnregisterFlag();
> if (currentDAG != null
> && !currentDAG.isComplete()) {
> //send a DAG_TERMINATE message
> LOG.info("Sending a kill event to the current DAG"
> + ", dagId=" + currentDAG.getID());
> tryKillDAG(currentDAG, dagKillmessage);
> } else {
> LOG.info("No current running DAG, shutting down the AM");
> if (isSession && !state.equals(DAGAppMasterState.ERROR)) {
> state = DAGAppMasterState.SUCCEEDED;
> }
> shutdownHandler.shutdown();
> }
> }
> }
>
> This can result in issues in dependent systems like Hive which will move
> ahead with other tasks in pipeline assuming the DAG was success, this can
> result in moving empty data in Hive
> As part of this JIRA, we are proposing to introduce a patch in TEZ, which
> introduces a config, which when set, then in case of shutdown with no current
> running DAGs, Tez status will always be marked as FAILED instead of SUCCEEDED
> in case DAG state at that time as not ERROR
>
> This is the patch, please review and let us know about your thoughts:
> [^0001-TEZ-4474-Added-config-to-fail-the-DAG-status-when-sh.patch]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)