[
https://issues.apache.org/jira/browse/TEZ-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14644672#comment-14644672
]
Hitesh Shah commented on TEZ-2311:
----------------------------------
Comments:
- why is the dag kill requested being logged in DAGImpl and not in
DAGAppMaster when the request was first received? i.e. as part of either
sessionStop or the tryKillDAG call.
- The fix does not really solve the case where the user tried to stop the
session but the AM crashed before the RM unregister happened and some recovery
logs were not fully synced. Logging the event when the kill request was
received by the initial handler could also add a flag as to whether a session
stop was requested.
- Any reason why the full process of enacting a dag kill is being
implemented as compared to just using the same approach as recovering a dag
finished summary event with killed state? Wouldnt that be simpler? There is
probably no need to try and recover all kinds of information for a killed dag
but simply a case of moving all objects to a final desired state? This
obviously should only be done if the dag had not completed ( i.e not in a
successful or failed state as DAG_KILL is ignored in a successful or failed
state ).
> AM can hang if kill received while recovering from previous attempt
> -------------------------------------------------------------------
>
> Key: TEZ-2311
> URL: https://issues.apache.org/jira/browse/TEZ-2311
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.6.0
> Reporter: Jason Lowe
> Assignee: Jeff Zhang
> Labels: Recovery
> Attachments: TEZ-2311-1.patch, TEZ-2311-2.patch, TEZ-2311-3.patch
>
>
> We saw an instance of a Tez job hanging despite receiving multiple kill
> requests from clients. The AM was recovering from a prior attempt when the
> first kill request arrived.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)