[
https://issues.apache.org/jira/browse/TEZ-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14639432#comment-14639432
]
Jeff Zhang commented on TEZ-2311:
---------------------------------
bq. The DAGImpl change does not seem right. When vertex A is killed, why is
Vertex B being killed by the DAG? The DAG should be triggering a kill for all
vertices or a sub-set of them on certain conditions. Adding this code creates a
loop of events.
The kill would only happen one time. Because we would check dag's
terminationCause before killing vertices.
{code}
void enactKill(DAGTerminationCause dagTerminationCause,
VertexTerminationCause vertexTerminationCause) {
if(trySetTerminationCause(dagTerminationCause)){
for (Vertex v : vertices.values()) {
eventHandler.handle(
new VertexEventTermination(v.getVertexId(), vertexTerminationCause)
);
}
}
}
{code}
Post another patch to fix the null checking of vertex.tasks
> AM can hang if kill received while recovering from previous attempt
> -------------------------------------------------------------------
>
> Key: TEZ-2311
> URL: https://issues.apache.org/jira/browse/TEZ-2311
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.6.0
> Reporter: Jason Lowe
> Assignee: Jeff Zhang
> Labels: Recovery
> Attachments: TEZ-2311-1.patch, TEZ-2311-2.patch
>
>
> We saw an instance of a Tez job hanging despite receiving multiple kill
> requests from clients. The AM was recovering from a prior attempt when the
> first kill request arrived.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)