[ 
https://issues.apache.org/jira/browse/TEZ-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15373411#comment-15373411
 ] 

Jason Lowe commented on TEZ-3335:
---------------------------------

I thought about fixing this on the YARN side.  The YarnClient currently 
auto-redirects to the AHS when the RM doesn't know about an app.  It could 
detect that the AHS report doesn't contain a status, so therefore the app is 
essentially lost at that point.  The RM doesnt' know about it, and the AHS 
never got a completion event for it.  However I didn't want the AHS client to 
throw an exception for that case since the app report does contain _some_ 
useful information about the lost app, such as user, queue, start time, app 
name, etc.  Throwing an exception means the user gets no details about the app, 
so returning what we do know seemed more prudent.

The problem with the AHS or client trying to fix this on the YARN side is that 
we don't know what the final status of the application was.  It could be any of 
FAILED, KILLED, or SUCCEEDED if the completion event tried to get posted to the 
AHS but was dropped for some reason.  Therefore it seems a bit dangerous to 
assume one of those three.  We could always add a new status like LOST or 
UNKNOWN, etc., but of course that requires app frameworks to update themselves 
to detect and react properly to the new state.


> DAG client thinks app is still running when app status is null
> --------------------------------------------------------------
>
>                 Key: TEZ-3335
>                 URL: https://issues.apache.org/jira/browse/TEZ-3335
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.7.1
>            Reporter: Jason Lowe
>
> When an RM restarts without recovering apps (i.e.: either work-preserving is 
> not enabled or state store was removed) and the YARN application history is 
> enabled then YarnClient can return an application report with the app status 
> as null.  The RM doesn't know about the application, so the client redirects 
> to the AHS.  The AHS knows the app started at some point but will never 
> received a finished event, hence the null app status.
> The DAG client fails to detect this scenario and believes the app is still 
> running, so for example Hive clients will continue to hammer for status on an 
> app that doesn't exist.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to