[
https://issues.apache.org/jira/browse/TEZ-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14638970#comment-14638970
]
Jason Lowe commented on TEZ-2311:
---------------------------------
Ideally we would like this fixed in a 0.7 patch release since 0.8 is probably a
ways out, at least from the point of us being able to deploy it. Could you
elaborate on what's not clear from the above analysis or what context is
missing? It seems wrong to me that the VertexImpl recorded the fact that it
wanted to recover into the KILLED state but then ignored that fact when it
later executed the recovery of tasks. Here's the breakdown in more detail:
# We recover the fact that VertexImpl is supposed to recover into the KILLED
state
# That causes it to generate TaskRecoverEvents to try to recover into the
KILLED state, but then the vertex sends task recover events to all the tasks
and the vertex recovers into the RUNNING state to wait for all tasks to finish
recovering
# In the task recovering code, it explicitly ignores the desired recovering
state because taskEventRecoverTask.recoverData() is true.
# The tasks get an event with recoverData = true because of the first code
block in the above analysis. When it generates the task recover events it's
calling the event constructor form that implicitly defaults recoverData to true.
It looks like we need a fix similar to the last patch hunk in TEZ-1011. I
don't think we should be passing recoverData as true in the task recover event
for this scenario, but I could be mistaken since I'm a bit unclear on when
recoverData is valid. Maybe the bug is we should try to recover data for the
tasks but not forget that we're trying to recover them into the killed state.
> AM can hang if kill received while recovering from previous attempt
> -------------------------------------------------------------------
>
> Key: TEZ-2311
> URL: https://issues.apache.org/jira/browse/TEZ-2311
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.6.0
> Reporter: Jason Lowe
> Labels: Recovery
>
> We saw an instance of a Tez job hanging despite receiving multiple kill
> requests from clients. The AM was recovering from a prior attempt when the
> first kill request arrived.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)