Ankit Malhotra created MAPREDUCE-6190:
-----------------------------------------

             Summary: MR Job is stuck because of one mapper stuck in STARTING
                 Key: MAPREDUCE-6190
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6190
             Project: Hadoop Map/Reduce
          Issue Type: Bug
            Reporter: Ankit Malhotra


Trying to figure out a weird issue we started seeing on our CDH5.1.0 cluster 
with map reduce jobs on YARN.

We had a job stuck for hours because one of the mappers never started up fully. 
Basically, the map task had 2 attempts, the first one failed and the AM tried 
to schedule a second one and the second attempt was stuck on STATE: STARTING, 
STATUS: NEW. A node never got assigned and the task along with the job was 
stuck indefinitely.

The AM logs had this being logged again and again:

{code}
2014-12-09 19:25:12,347 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down 0
2014-12-09 19:25:13,352 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received completed 
container container_1408745633994_450952_02_003807
2014-12-09 19:25:13,352 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce preemption 
successful attempt_1408745633994_450952_r_000048_1000
2014-12-09 19:25:13,352 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all 
scheduled reduces:0
2014-12-09 19:25:13,352 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Going to preempt 1
2014-12-09 19:25:13,353 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Preempting 
attempt_1408745633994_450952_r_000050_1000
2014-12-09 19:25:13,353 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating 
schedule, headroom=0
2014-12-09 19:25:13,353 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: completedMapPercent 
0.99968 totalMemLimit:1722880 finalMapMemLimit:2560 finalReduceMemLimit:1720320 
netScheduledMapMem:2560 netScheduledReduceMem:1722880
2014-12-09 19:25:13,353 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down 0
2014-12-09 19:25:13,353 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
PendingReds:77 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:0 AssignedReds:673 
CompletedMaps:3124 CompletedReds:0 ContAlloc:4789 ContRel:798 HostLocal:2944 
RackLocal:155
2014-12-09 19:25:14,353 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: 
PendingReds:78 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:0 AssignedReds:673 
CompletedMaps:3124 CompletedReds:0 ContAlloc:4789 ContRel:798 HostLocal:2944 
RackLocal:155
2014-12-09 19:25:14,359 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating 
schedule, headroom=0
{code}

On killing the task manually, the AM started up the task again, scheduled and 
ran it successfully completing the task and the job with it.

Some quick code grepping led us here:
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-mapreduce-client-app/2.3.0/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java#397

But still dont quite understand why this would happen once in a while and why 
the job would suddenly be ok once the stuck task is manually killed.

Note: Other jobs succeed on the cluster while this job is stuck.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to