[jira] [Commented] (TEZ-2311) AM can hang if kill received while recovering from previous attempt

Jeff Zhang (JIRA) Thu, 23 Jul 2015 14:30:18 -0700

    [ 
https://issues.apache.org/jira/browse/TEZ-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14639538#comment-14639538
 ]


Jeff Zhang commented on TEZ-2311:
---------------------------------

I thought about the adding recovery log for DAG kill operation, but may be a 
little heavy here. Think it again, seems not difficult ( Will post another 
patch ). The change in VertexImpl doesn't resolve the hang issue completely. 
Consider one case that all the vertices are recovered to KILLED, and one vertex 
is recovered to running and new task attempt is scheduled. That new task 
attempt may wait there indefinitely for datamovement events from its upstream. 
Or maybe task attempt is not scheduled, its VertexManager may wait there for 
something from upstream. 

> AM can hang if kill received while recovering from previous attempt
> -------------------------------------------------------------------
>
>                 Key: TEZ-2311
>                 URL: https://issues.apache.org/jira/browse/TEZ-2311
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Jason Lowe
>            Assignee: Jeff Zhang
>              Labels: Recovery
>         Attachments: TEZ-2311-1.patch, TEZ-2311-2.patch
>
>
> We saw an instance of a Tez job hanging despite receiving multiple kill 
> requests from clients.  The AM was recovering from a prior attempt when the 
> first kill request arrived.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-2311) AM can hang if kill received while recovering from previous attempt

Reply via email to