[jira] [Commented] (MAPREDUCE-6190) If a task stucks before its first heartbeat, it never timeouts and the MR job becomes stuck

Akira Ajisaka (JIRA) Fri, 22 Mar 2019 00:32:58 -0700


    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16798797#comment-16798797
 ]


Akira Ajisaka commented on MAPREDUCE-6190:
------------------------------------------

Nice catch [~bibinchundatt].
{code}
diff --git 
a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/TaskHeartbeatHandler.java
 
b/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/TaskHeartbeatHandler.java
index 456f2a66c8f..e987e7e97d1 100644
--- 
a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/TaskHeartbeatHandler.java
+++ 
b/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/TaskHeartbeatHandler.java
@@ -192,7 +192,8 @@ private void checkRunning(long currentTime) {
             (currentTime > (entry.getValue().getLastProgress() + taskTimeOut));
         // when container in NM not started in a long time,
         // we think the taskAttempt is stuck
-        boolean taskStuck = (!entry.getValue().isReported()) &&
+        boolean taskStuck = (taskTimeOut > 0) &&
+            (!entry.getValue().isReported()) &&
             (currentTime >
                 (entry.getValue().getLastProgress() + taskStuckTimeOut));
{code}
Is the above change works for you? If it works, I'll file a separate jira and 
upload this patch.

> If a task stucks before its first heartbeat, it never timeouts and the MR job 
> becomes stuck
> -------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6190
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6190
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 2.6.0, 2.7.0, 2.8.0, 2.9.0, 3.0.0, 3.1.1
>            Reporter: Ankit Malhotra
>            Assignee: Zhaohui Xin
>            Priority: Major
>             Fix For: 3.3.0
>
>         Attachments: MAPREDUCE-6190.001.patch, MAPREDUCE-6190.002.patch, 
> MAPREDUCE-6190.003.patch, MAPREDUCE-6190.004.patch, MAPREDUCE-6190.005.patch
>
>
> Trying to figure out a weird issue we started seeing on our CDH5.1.0 cluster 
> with map reduce jobs on YARN.
> We had a job stuck for hours because one of the mappers never started up 
> fully. Basically, the map task had 2 attempts, the first one failed and the 
> AM tried to schedule a second one and the second attempt was stuck on STATE: 
> STARTING, STATUS: NEW. A node never got assigned and the task along with the 
> job was stuck indefinitely.
> The AM logs had this being logged again and again:
> {code}
> 2014-12-09 19:25:12,347 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down 0
> 2014-12-09 19:25:13,352 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received 
> completed container container_1408745633994_450952_02_003807
> 2014-12-09 19:25:13,352 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce preemption 
> successful attempt_1408745633994_450952_r_000048_1000
> 2014-12-09 19:25:13,352 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all 
> scheduled reduces:0
> 2014-12-09 19:25:13,352 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Going to preempt 1
> 2014-12-09 19:25:13,353 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Preempting 
> attempt_1408745633994_450952_r_000050_1000
> 2014-12-09 19:25:13,353 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating 
> schedule, headroom=0
> 2014-12-09 19:25:13,353 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: 
> completedMapPercent 0.99968 totalMemLimit:1722880 finalMapMemLimit:2560 
> finalReduceMemLimit:1720320 netScheduledMapMem:2560 
> netScheduledReduceMem:1722880
> 2014-12-09 19:25:13,353 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down 0
> 2014-12-09 19:25:13,353 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
> PendingReds:77 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:0 
> AssignedReds:673 CompletedMaps:3124 CompletedReds:0 ContAlloc:4789 
> ContRel:798 HostLocal:2944 RackLocal:155
> 2014-12-09 19:25:14,353 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before 
> Scheduling: PendingReds:78 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:0 
> AssignedReds:673 CompletedMaps:3124 CompletedReds:0 ContAlloc:4789 
> ContRel:798 HostLocal:2944 RackLocal:155
> 2014-12-09 19:25:14,359 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating 
> schedule, headroom=0
> {code}
> On killing the task manually, the AM started up the task again, scheduled and 
> ran it successfully completing the task and the job with it.
> Some quick code grepping led us here:
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-mapreduce-client-app/2.3.0/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java#397
> But still dont quite understand why this would happen once in a while and why 
> the job would suddenly be ok once the stuck task is manually killed.
> Note: Other jobs succeed on the cluster while this job is stuck.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (MAPREDUCE-6190) If a task stucks before its first heartbeat, it never timeouts and the MR job becomes stuck

Reply via email to