[jira] [Commented] (TEZ-2311) AM can hang if kill received while recovering from previous attempt

Jason Lowe (JIRA) Thu, 23 Jul 2015 08:13:38 -0700

    [ 
https://issues.apache.org/jira/browse/TEZ-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14638970#comment-14638970
 ]


Jason Lowe commented on TEZ-2311:
---------------------------------

Ideally we would like this fixed in a 0.7 patch release since 0.8 is probably a 
ways out, at least from the point of us being able to deploy it.   Could you 
elaborate on what's not clear from the above analysis or what context is 
missing?  It seems wrong to me that the VertexImpl recorded the fact that it 
wanted to recover into the KILLED state but then ignored that fact when it 
later executed the recovery of tasks.  Here's the breakdown in more detail:

# We recover the fact that VertexImpl is supposed to recover into the KILLED 
state
# That causes it to generate TaskRecoverEvents to try to recover into the 
KILLED state, but then the vertex sends task recover events to all the tasks 
and the vertex recovers into the RUNNING state to wait for all tasks to finish 
recovering
# In the task recovering code, it explicitly ignores the desired recovering 
state because taskEventRecoverTask.recoverData() is true.
# The tasks get an event with recoverData = true because of the first code 
block in the above analysis.  When it generates the task recover events it's 
calling the event constructor form that implicitly defaults recoverData to true.

It looks like we need a fix similar to the last patch hunk in TEZ-1011.  I 
don't think we should be passing recoverData as true in the task recover event 
for this scenario, but I could be mistaken since I'm a bit unclear on when 
recoverData is valid.  Maybe the bug is we should try to recover data for the 
tasks but not forget that we're trying to recover them into the killed state.

> AM can hang if kill received while recovering from previous attempt
> -------------------------------------------------------------------
>
>                 Key: TEZ-2311
>                 URL: https://issues.apache.org/jira/browse/TEZ-2311
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Jason Lowe
>              Labels: Recovery
>
> We saw an instance of a Tez job hanging despite receiving multiple kill 
> requests from clients.  The AM was recovering from a prior attempt when the 
> first kill request arrived.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-2311) AM can hang if kill received while recovering from previous attempt

Reply via email to