[ 
https://issues.apache.org/jira/browse/TEZ-4474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17688552#comment-17688552
 ] 

Syed Shameerur Rahman commented on TEZ-4474:
--------------------------------------------

I still feel that the patch tries to fix the symptom rather than fixing the 
root cause. As per your comment " the recovery was not failing, the code is 
such that if recovery is not able to parse anything, it returns null as of now" 
It looks like the actual problem is DAG recovery not able to parse, Shouldn't 
we try to fix it rather than suppressing the symptom?

 

 [~abstractdog] what are your thoughts on this?

> DAG recovery failure leads to AM status SUCCEEDED
> -------------------------------------------------
>
>                 Key: TEZ-4474
>                 URL: https://issues.apache.org/jira/browse/TEZ-4474
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.9.2, 0.10.0, 0.10.1, 0.10.2
>            Reporter: Mudit Sharma
>            Priority: Critical
>         Attachments: 
> 0001-TEZ-4474-Added-config-to-fail-the-DAG-status-when-sh.patch
>
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> Summary of the Issue:
> When Tez DAG recovery is failed because of some reason in the second retry of 
> any Tez AM, then in corner case scenario, Tez Job sets DAG state to IDLE
> Once the DAG state is set to IDLE, then after checkAndHandleSessionTimeout(), 
> Tez AM will try to shutdown the DAG, and since recovery was failed so there 
> will not be any running DAGs
> If there are no RUNNING DAGs and state of DAG is IDLE, then by default AM 
> sets the status to SUCCEEDED, because of this if-else:
> [https://github.com/apache/tez/blob/master/tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java#L1266]
> {code:java}
> public void shutdownTezAM(String dagKillmessage) throws TezException {
>     if (!sessionStopped.compareAndSet(false, true))
> {       // No need to shutdown twice.       // Return with a no-op if 
> shutdownTezAM has been invoked earlier.       return;     }
>     synchronized (this) {
>       this.taskSchedulerManager.setShouldUnregisterFlag();
>       if (currentDAG != null
>           && !currentDAG.isComplete())
> {         //send a DAG_TERMINATE message         LOG.info("Sending a kill 
> event to the current DAG"             + ", dagId=" + currentDAG.getID());     
>     tryKillDAG(currentDAG, dagKillmessage);       }
> else {
>         LOG.info("No current running DAG, shutting down the AM");
>         if (isSession && !state.equals(DAGAppMasterState.ERROR))
> {           state = DAGAppMasterState.SUCCEEDED;         }
>         shutdownHandler.shutdown();
>       }
>     }
>   }
> {code}
>  
> This can result in issues in dependent systems like Hive which will move 
> ahead with other tasks in pipeline assuming the DAG was success, this can 
> result in moving empty data in Hive
> As part of this JIRA, we are proposing to introduce a patch in TEZ, which 
> introduces a config, which when set, then in case of shutdown with no current 
> running DAGs, Tez status will always be marked as FAILED instead of SUCCEEDED 
> in case DAG state at that time was not ERROR
>  
> PR: [https://github.com/apache/tez/pull/266] 
> This is the patch, please review and let us know about your thoughts: 
> [^0001-TEZ-4474-Added-config-to-fail-the-DAG-status-when-sh.patch]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to