[ 
https://issues.apache.org/jira/browse/TEZ-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14639413#comment-14639413
 ] 

Hitesh Shah commented on TEZ-2311:
----------------------------------

Comments: 
   - The DAGImpl change does not seem right. When vertex A is killed, why is 
Vertex B being killed by the DAG? The DAG should be triggering a kill for all 
vertices or a sub-set of them on certain conditions. Adding this code creates a 
loop of events. 

Consider the case where there are 5 vertices: A to E. A fails for some reason. 
DAG will trigger kill on B to E. When B returns to DAG as Vertex completed with 
state KILLED, DAG will again re-trigger KILL on all vertices. This seems wrong. 

VertexImpl changes:

{code}
for (Task task : vertex.tasks.values()) {
{code}
   - the new code additions need to do "(vertex.tasks != null && 
vertex.numTasks != 0) " checks to ensure that NEW to KILLED does not cause an 
NPE. 


   


> AM can hang if kill received while recovering from previous attempt
> -------------------------------------------------------------------
>
>                 Key: TEZ-2311
>                 URL: https://issues.apache.org/jira/browse/TEZ-2311
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Jason Lowe
>            Assignee: Jeff Zhang
>              Labels: Recovery
>         Attachments: TEZ-2311-1.patch
>
>
> We saw an instance of a Tez job hanging despite receiving multiple kill 
> requests from clients.  The AM was recovering from a prior attempt when the 
> first kill request arrived.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to