[
https://issues.apache.org/jira/browse/TEZ-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15373411#comment-15373411
]
Jason Lowe commented on TEZ-3335:
---------------------------------
I thought about fixing this on the YARN side. The YarnClient currently
auto-redirects to the AHS when the RM doesn't know about an app. It could
detect that the AHS report doesn't contain a status, so therefore the app is
essentially lost at that point. The RM doesnt' know about it, and the AHS
never got a completion event for it. However I didn't want the AHS client to
throw an exception for that case since the app report does contain _some_
useful information about the lost app, such as user, queue, start time, app
name, etc. Throwing an exception means the user gets no details about the app,
so returning what we do know seemed more prudent.
The problem with the AHS or client trying to fix this on the YARN side is that
we don't know what the final status of the application was. It could be any of
FAILED, KILLED, or SUCCEEDED if the completion event tried to get posted to the
AHS but was dropped for some reason. Therefore it seems a bit dangerous to
assume one of those three. We could always add a new status like LOST or
UNKNOWN, etc., but of course that requires app frameworks to update themselves
to detect and react properly to the new state.
> DAG client thinks app is still running when app status is null
> --------------------------------------------------------------
>
> Key: TEZ-3335
> URL: https://issues.apache.org/jira/browse/TEZ-3335
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.7.1
> Reporter: Jason Lowe
>
> When an RM restarts without recovering apps (i.e.: either work-preserving is
> not enabled or state store was removed) and the YARN application history is
> enabled then YarnClient can return an application report with the app status
> as null. The RM doesn't know about the application, so the client redirects
> to the AHS. The AHS knows the app started at some point but will never
> received a finished event, hence the null app status.
> The DAG client fails to detect this scenario and believes the app is still
> running, so for example Hive clients will continue to hammer for status on an
> app that doesn't exist.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)