[
https://issues.apache.org/jira/browse/TEZ-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14645195#comment-14645195
]
Jeff Zhang commented on TEZ-2311:
---------------------------------
bq. why is the dag kill requested being logged in DAGImpl and not in
DAGAppMaster when the request was first received? i.e. as part of either
sessionStop or the tryKillDAG call.
bq. The fix does not really solve the case where the user tried to stop the
session but the AM crashed before the RM unregister happened and some recovery
logs were not fully synced. Logging the event when the kill request was
received by the initial handler could also add a flag as to whether a session
stop was requested.
Right, seems logging event in DAGAppMaster would be the only way to handle this
case.
bq. Any reason why the full process of enacting a dag kill is being implemented
as compared to just using the same approach as recovering a dag finished
summary event with killed state? Wouldnt that be simpler? There is probably no
need to try and recover all kinds of information for a killed dag but simply a
case of moving all objects to a final desired state? This obviously should only
be done if the dag had not completed ( i.e not in a successful or failed state
as DAG_KILL is ignored in a successful or failed state ).
The purpose is to reuse the code in
KillNewJobTransition/KillInitedJobTransition/DAGKilledTransition such
setTerminationCause. Could do it as to recover to the desired state.
> AM can hang if kill received while recovering from previous attempt
> -------------------------------------------------------------------
>
> Key: TEZ-2311
> URL: https://issues.apache.org/jira/browse/TEZ-2311
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.6.0
> Reporter: Jason Lowe
> Assignee: Jeff Zhang
> Labels: Recovery
> Attachments: TEZ-2311-1.patch, TEZ-2311-2.patch, TEZ-2311-3.patch
>
>
> We saw an instance of a Tez job hanging despite receiving multiple kill
> requests from clients. The AM was recovering from a prior attempt when the
> first kill request arrived.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)