[
https://issues.apache.org/jira/browse/TEZ-4474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17687810#comment-17687810
]
Mudit Sharma commented on TEZ-4474:
-----------------------------------
[~srahman] , there are several checks there
It should be in Session + No DAG submitted + Config is set
Basically for some scenarios, marking it as SUCCEEDED also is not correct. If
by any chance, due to timeout etc shutdown is called, we should ideally mark it
as FAILED and retry once more but still since I was not sure of all the
scenarios, so I am proposing it to be behind a config
> DAG recovery failure leads to AM status SUCCEEDED
> -------------------------------------------------
>
> Key: TEZ-4474
> URL: https://issues.apache.org/jira/browse/TEZ-4474
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.9.2, 0.10.0, 0.10.1, 0.10.2
> Reporter: Mudit Sharma
> Priority: Critical
> Attachments:
> 0001-TEZ-4474-Added-config-to-fail-the-DAG-status-when-sh.patch
>
> Time Spent: 50m
> Remaining Estimate: 0h
>
> Summary of the Issue:
> When Tez DAG recovery is failed because of some reason in the second retry of
> any Tez AM, then in corner case scenario, Tez Job sets DAG state to IDLE
> Once the DAG state is set to IDLE, then after checkAndHandleSessionTimeout(),
> Tez AM will try to shutdown the DAG, and since recovery was failed so there
> will not be any running DAGs
> If there are no RUNNING DAGs and state of DAG is IDLE, then by default AM
> sets the status to SUCCEEDED, because of this if-else:
> [https://github.com/apache/tez/blob/master/tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java#L1266]
> {code:java}
> public void shutdownTezAM(String dagKillmessage) throws TezException {
> if (!sessionStopped.compareAndSet(false, true))
> { // No need to shutdown twice. // Return with a no-op if
> shutdownTezAM has been invoked earlier. return; }
> synchronized (this) {
> this.taskSchedulerManager.setShouldUnregisterFlag();
> if (currentDAG != null
> && !currentDAG.isComplete())
> { //send a DAG_TERMINATE message LOG.info("Sending a kill
> event to the current DAG" + ", dagId=" + currentDAG.getID());
> tryKillDAG(currentDAG, dagKillmessage); }
> else {
> LOG.info("No current running DAG, shutting down the AM");
> if (isSession && !state.equals(DAGAppMasterState.ERROR))
> { state = DAGAppMasterState.SUCCEEDED; }
> shutdownHandler.shutdown();
> }
> }
> }
> {code}
>
> This can result in issues in dependent systems like Hive which will move
> ahead with other tasks in pipeline assuming the DAG was success, this can
> result in moving empty data in Hive
> As part of this JIRA, we are proposing to introduce a patch in TEZ, which
> introduces a config, which when set, then in case of shutdown with no current
> running DAGs, Tez status will always be marked as FAILED instead of SUCCEEDED
> in case DAG state at that time was not ERROR
>
> PR: [https://github.com/apache/tez/pull/266]
> This is the patch, please review and let us know about your thoughts:
> [^0001-TEZ-4474-Added-config-to-fail-the-DAG-status-when-sh.patch]
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)