[ https://issues.apache.org/jira/browse/MAPREDUCE-6485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14877128#comment-14877128 ]
Bob commented on MAPREDUCE-6485: -------------------------------- @[~varun_saxena] Thanks for your checking this issue. Below are some related logs you can refer to. *1. All logs in AM about first attempt:* {code} 03:30:32,457 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1439640536651_4710_m_002807_0 TaskAttempt Transitioned from NEW to UNASSIGNED 03:33:22,037 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_1439640536651_4710_01_003425 to attempt_1439640536651_4710_m_002807_0 03:33:22,038 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1439640536651_4710_m_002807_0 TaskAttempt Transitioned from UNASSIGNED to ASSIGNED 03:33:22,044 INFO [ContainerLauncher #27] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing the event EventType: CONTAINER_REMOTE_LAUNCH for container container_1439640536651_4710_01_003425 taskAttempt attempt_1439640536651_4710_m_002807_0 03:33:22,044 INFO [ContainerLauncher #27] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Launching attempt_1439640536651_4710_m_002807_0 03:33:22,071 INFO [ContainerLauncher #27] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Shuffle port returned by ContainerManager for attempt_1439640536651_4710_m_002807_0 : 26008 03:33:22,071 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: TaskAttempt: [attempt_1439640536651_4710_m_002807_0] using containerId: [container_1439640536651_4710_01_003425 on NM: [SCCHDPHIV02129:26009] 03:33:22,071 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1439640536651_4710_m_002807_0 TaskAttempt Transitioned from ASSIGNED to RUNNING 03:33:31,481 INFO [IPC Server handler 24 on 27102] org.apache.hadoop.mapred.TaskAttemptListenerImpl: JVM with ID: jvm_1439640536651_4710_m_003425 given task: attempt_1439640536651_4710_m_002807_0 03:43:31,473 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1439640536651_4710_m_002807_0: AttemptID:attempt_1439640536651_4710_m_002807_0 Timed out after 600 secs 03:43:31,473 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1439640536651_4710_m_002807_0: AttemptID:attempt_1439640536651_4710_m_002807_0 Timed out after 600 secs 03:43:31,474 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1439640536651_4710_m_002807_0 TaskAttempt Transitioned from RUNNING to FAIL_CONTAINER_CLEANUP 03:43:31,474 INFO [ContainerLauncher #23] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container container_1439640536651_4710_01_003425 taskAttempt attempt_1439640536651_4710_m_002807_0 03:43:31,474 INFO [ContainerLauncher #23] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING attempt_1439640536651_4710_m_002807_0 03:43:31,478 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1439640536651_4710_m_002807_0 TaskAttempt Transitioned from FAIL_CONTAINER_CLEANUP to FAIL_TASK_CLEANUP 03:43:31,478 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1439640536651_4710_m_002807_0 TaskAttempt Transitioned from FAIL_TASK_CLEANUP to FAILED 03:43:32,701 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1439640536651_4710_m_002807_0: Container killed by the ApplicationMaster. {code} *2. All logs in AM about second attempt (only have one split):* {code} 03:39:55,339 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1439640536651_4710_m_002807_1 TaskAttempt Transitioned from NEW to UNASSIGNED {code} *3. Checking the log with time_stamp we can see later available resource are all allocated to reduces not the last map attempt after the second attempt started :* {code} 03:39:55,978 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_1439640536651_4710_01_015149 to attempt_1439640536651_4710_r_000669_0 03:39:55,978 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:0 ScheduledMaps:1 ScheduledReds:330 AssignedMaps:1 AssignedReds:669 CompletedMaps:14257 CompletedReds:0 ContAlloc:14958 ContRel:30 HostLocal:14221 RackLocal:38 03:40:01,004 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated containers 1 03:40:01,004 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned to reduce 03:40:01,004 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_1439640536651_4710_01_015150 to attempt_1439640536651_4710_r_000670_0 03:40:01,004 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:0 ScheduledMaps:1 ScheduledReds:329 AssignedMaps:1 AssignedReds:670 CompletedMaps:14257 CompletedReds:0 ContAlloc:14959 ContRel:30 HostLocal:14221 RackLocal:38 03:40:05,025 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated containers 1 03:40:05,025 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned to reduce 03:40:05,025 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_1439640536651_4710_01_015151 to attempt_1439640536651_4710_r_000671_0 03:40:05,025 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:0 ScheduledMaps:1 ScheduledReds:328 AssignedMaps:1 AssignedReds:671 CompletedMaps:14257 CompletedReds:0 ContAlloc:14960 ContRel:30 HostLocal:14221 RackLocal:38 03:40:15,078 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated containers 1 03:40:15,078 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned to reduce 03:40:15,140 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_1439640536651_4710_01_015152 to attempt_1439640536651_4710_r_000672_0 03:40:15,140 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:0 ScheduledMaps:1 ScheduledReds:327 AssignedMaps:1 AssignedReds:672 CompletedMaps:14257 CompletedReds:0 ContAlloc:14961 ContRel:30 HostLocal:14221 RackLocal:38 03:40:19,161 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated containers 1 03:40:19,161 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned to reduce 03:40:19,161 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_1439640536651_4710_01_015153 to attempt_1439640536651_4710_r_000673_0 03:40:19,161 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:0 ScheduledMaps:1 ScheduledReds:326 AssignedMaps:1 AssignedReds:673 CompletedMaps:14257 CompletedReds:0 ContAlloc:14962 ContRel:30 HostLocal:14221 RackLocal:38 03:43:32,701 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received completed container container_1439640536651_4710_01_003425 03:43:32,701 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated containers 1 03:43:32,701 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned to reduce 03:43:32,701 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_1439640536651_4710_01_015154 to attempt_1439640536651_4710_r_000674_0 03:43:32,701 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:0 ScheduledMaps:1 ScheduledReds:325 AssignedMaps:0 AssignedReds:674 CompletedMaps:14257 CompletedReds:0 ContAlloc:14963 ContRel:30 HostLocal:14221 RackLocal:38 .... {code} > MR job hanged forever because all resources are taken up by reducers and the > last map attempt never get resource to run > ----------------------------------------------------------------------------------------------------------------------- > > Key: MAPREDUCE-6485 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6485 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster > Affects Versions: 3.0.0, 2.4.1, 2.6.0, 2.7.1 > Reporter: Bob > Priority: Critical > > The scenarios is like this: > With configuring mapreduce.job.reduce.slowstart.completedmaps=0.8, reduces > will take resource and start to run when all the map have not finished. > But It could happened that when all the resources are taken up by running > reduces, there is still one map not finished. > Under this condition , the last map have two task attempts . > As for the first attempt was killed due to timeout(mapreduce.task.timeout), > and its state transitioned from RUNNING to FAIL_CONTAINER_CLEANUP, so failed > map attempt would not be started. > As for the second attempt which was started due to having enable map task > speculative is pending at UNASSINGED state because of no resource available. > But the second map attempt request have lower priority than reduces, so > preemption would not happened. > As a result all reduces would not finished because of there is one map left. > and the last map hanged there because of no resource available. so, the job > would never finish. -- This message was sent by Atlassian JIRA (v6.3.4#6332)