[ 
https://issues.apache.org/jira/browse/TEZ-3696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15987603#comment-15987603
 ] 

Eric Badger commented on TEZ-3696:
----------------------------------

{noformat}
diff --git 
a/tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskImpl.java 
b/tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskImpl.java
index ec7db614a..1a29978a2 100644
--- a/tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskImpl.java
+++ b/tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskImpl.java
@@ -852,9 +852,7 @@ public class TaskImpl implements Task, 
EventHandler<TaskEvent> {
   private void handleTaskAttemptCompletion(TezTaskAttemptID attemptId,
       TaskAttemptStateInternal attemptState) {
     this.sendTaskAttemptCompletionEvent(attemptId, attemptState);
-    if (getInternalState() != TaskStateInternal.SUCCEEDED) {
-      sendDAGSchedulerFinishedEvent(attemptId); // not a retro active action
-    }
+    sendDAGSchedulerFinishedEvent(attemptId); // not a retro active action
{noformat}

[~bikassaha], could you review the patch, especially the change quoted above? I 
couldn't figure out why TaskImpl wouldn't send the TA_COMPLETED event to the 
scheduler regardless of the state of the Task. But, it was made explicitly in 
TEZ-2914, so I wanted to make sure that I wasn't breaking something that you 
thought about when writing up your original patch for concurrency. 

> Jobs can hang when both concurrency and speculation are enabled
> ---------------------------------------------------------------
>
>                 Key: TEZ-3696
>                 URL: https://issues.apache.org/jira/browse/TEZ-3696
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Eric Badger
>            Assignee: Eric Badger
>         Attachments: TEZ-3696.001.patch, TEZ-3696.002.patch
>
>
> We can reproduce the hung job by doing the following: 
> 1. Run a sleep job with a concurrency of 1, speculation enabled, and 3 tasks 
> {noformat}
> HADOOP_CLASSPATH="$TEZ_HOME/*:$TEZ_HOME/lib/*:$TEZ_CONF_DIR" yarn jar 
> $TEZ_HOME/tez-tests-*.jar mrrsleep -Dtez.am.vertex.max-task-concurrency=1 
> -Dtez.am.speculation.enabled=true -Dtez.task.timeout-ms=60000 -m 3 -mt 60000 
> -ir 0 -irt 0 -r 0 -rt 0
> {noformat}
> 2. Let the 1st task run to completion and then stop the 2nd task so that a 
> speculative attempt is scheduled. Once the speculative attempt is scheduled 
> for the 2nd task, continue the original attempt and let it complete.
> {noformat}
> kill -STOP <pid>
> // wait a few seconds for a speculative attempt to kick off
> kill -CONT <pid>
> {noformat}
> 3. Kill the 3rd task, which will create a 2nd attempt
> {noformat}
> kill -9 <pid> 
> {noformat}
> 4. The next thing to be drawn off of the queue will be the speculative 
> attempt of the 2nd task. However, it is already completed, so it will just 
> sit in the final state and the job will hang. 
> Basically, for the failure to happen, the number of speculative tasks that 
> are scheduled, but not yet ran has to be >= the concurrency of the job and 
> there has to be at least 1 task failure. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to