[
https://issues.apache.org/jira/browse/TEZ-4474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mudit Sharma updated TEZ-4474:
------------------------------
Affects Version/s: 0.10.2
0.10.1
0.10.0
0.9.2
> DAG recovery failure leads to AM status SUCCEEDED
> -------------------------------------------------
>
> Key: TEZ-4474
> URL: https://issues.apache.org/jira/browse/TEZ-4474
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.9.2, 0.10.0, 0.10.1, 0.10.2
> Reporter: Mudit Sharma
> Priority: Critical
> Attachments:
> 0001-TEZ-4474-Added-config-to-fail-the-DAG-status-when-sh.patch
>
>
> Summary of the Issue:
> When Tez DAG recovery is failed because of some reason in the second retry of
> any Tez AM, then in corner case scenario, Tez Job sets DAG state to IDLE
> Once the DAG state is set to IDLE, then after checkAndHandleSessionTimeout(),
> Tez AM will try to shutdown the DAG, and since recovery was failed so there
> will not be any running DAGs
> If there are no RUNNING DAGs and state of DAG is IDLE, then by default AM
> sets the status to SUCCEEDED, because of this if-else:
> [https://github.com/apache/tez/blob/master/tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java#L1266]
> public void shutdownTezAM(String dagKillmessage) throws TezException {
> if (!sessionStopped.compareAndSet(false, true))
> { // No need to shutdown twice. // Return with a no-op if
> shutdownTezAM has been invoked earlier. return; }
> synchronized (this) {
> this.taskSchedulerManager.setShouldUnregisterFlag();
> if (currentDAG != null
> && !currentDAG.isComplete())
> { //send a DAG_TERMINATE message LOG.info("Sending a kill
> event to the current DAG" + ", dagId=" + currentDAG.getID());
> tryKillDAG(currentDAG, dagKillmessage); }
> else {
> LOG.info("No current running DAG, shutting down the AM");
> if (isSession && !state.equals(DAGAppMasterState.ERROR))
> { state = DAGAppMasterState.SUCCEEDED; }
> shutdownHandler.shutdown();
> }
> }
> }
>
> This can result in issues in dependent systems like Hive which will move
> ahead with other tasks in pipeline assuming the DAG was success, this can
> result in moving empty data in Hive
> As part of this JIRA, we are proposing to introduce a patch in TEZ, which
> introduces a config, which when set, then in case of shutdown with no current
> running DAGs, Tez status will always be marked as FAILED instead of SUCCEEDED
> in case DAG state at that time was not ERROR
>
> This is the patch, please review and let us know about your thoughts:
> [^0001-TEZ-4474-Added-config-to-fail-the-DAG-status-when-sh.patch]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)