[jira] [Commented] (MAPREDUCE-6485) MR job hanged forever because all resources are taken up by reducers and the last map attempt never get resource to run

Bob (JIRA) Sat, 19 Sep 2015 06:55:49 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14877128#comment-14877128
 ]


Bob commented on MAPREDUCE-6485:
--------------------------------

@[~varun_saxena] Thanks for your checking this issue. Below are some related 
logs you can refer to.
*1. All logs in AM about first attempt:*
{code}
03:30:32,457 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
attempt_1439640536651_4710_m_002807_0 TaskAttempt Transitioned from NEW to 
UNASSIGNED
03:33:22,037 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container 
container_1439640536651_4710_01_003425 to attempt_1439640536651_4710_m_002807_0
03:33:22,038 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
attempt_1439640536651_4710_m_002807_0 TaskAttempt Transitioned from UNASSIGNED 
to ASSIGNED
03:33:22,044 INFO [ContainerLauncher #27] 
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing 
the event EventType: CONTAINER_REMOTE_LAUNCH for container 
container_1439640536651_4710_01_003425 taskAttempt 
attempt_1439640536651_4710_m_002807_0
03:33:22,044 INFO [ContainerLauncher #27] 
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Launching 
attempt_1439640536651_4710_m_002807_0
03:33:22,071 INFO [ContainerLauncher #27] 
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Shuffle port 
returned by ContainerManager for attempt_1439640536651_4710_m_002807_0 : 26008
03:33:22,071 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: TaskAttempt: 
[attempt_1439640536651_4710_m_002807_0] using containerId: 
[container_1439640536651_4710_01_003425 on NM: [SCCHDPHIV02129:26009]
03:33:22,071 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
attempt_1439640536651_4710_m_002807_0 TaskAttempt Transitioned from ASSIGNED to 
RUNNING
03:33:31,481 INFO [IPC Server handler 24 on 27102] 
org.apache.hadoop.mapred.TaskAttemptListenerImpl: JVM with ID: 
jvm_1439640536651_4710_m_003425 given task: 
attempt_1439640536651_4710_m_002807_0
03:43:31,473 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report 
from attempt_1439640536651_4710_m_002807_0: 
AttemptID:attempt_1439640536651_4710_m_002807_0 Timed out after 600 secs
03:43:31,473 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report 
from attempt_1439640536651_4710_m_002807_0: 
AttemptID:attempt_1439640536651_4710_m_002807_0 Timed out after 600 secs
03:43:31,474 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
attempt_1439640536651_4710_m_002807_0 TaskAttempt Transitioned from RUNNING to 
FAIL_CONTAINER_CLEANUP
03:43:31,474 INFO [ContainerLauncher #23] 
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing 
the event EventType: CONTAINER_REMOTE_CLEANUP for container 
container_1439640536651_4710_01_003425 taskAttempt 
attempt_1439640536651_4710_m_002807_0
03:43:31,474 INFO [ContainerLauncher #23] 
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING 
attempt_1439640536651_4710_m_002807_0
03:43:31,478 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
attempt_1439640536651_4710_m_002807_0 TaskAttempt Transitioned from 
FAIL_CONTAINER_CLEANUP to FAIL_TASK_CLEANUP
03:43:31,478 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
attempt_1439640536651_4710_m_002807_0 TaskAttempt Transitioned from 
FAIL_TASK_CLEANUP to FAILED
03:43:32,701 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report 
from attempt_1439640536651_4710_m_002807_0: Container killed by the 
ApplicationMaster.
{code}
*2. All logs in AM about second  attempt (only have one split):*
{code}
03:39:55,339 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
attempt_1439640536651_4710_m_002807_1 TaskAttempt Transitioned from NEW to 
UNASSIGNED
{code}
*3. Checking the log with  time_stamp we can see later available resource are 
all allocated to reduces not the last map attempt after the second attempt 
started :*
{code}
03:39:55,978 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container 
container_1439640536651_4710_01_015149 to attempt_1439640536651_4710_r_000669_0
03:39:55,978 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
PendingReds:0 ScheduledMaps:1 ScheduledReds:330 AssignedMaps:1 AssignedReds:669 
CompletedMaps:14257 CompletedReds:0 ContAlloc:14958 ContRel:30 HostLocal:14221 
RackLocal:38
03:40:01,004 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated 
containers 1
03:40:01,004 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned to reduce
03:40:01,004 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container 
container_1439640536651_4710_01_015150 to attempt_1439640536651_4710_r_000670_0
03:40:01,004 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
PendingReds:0 ScheduledMaps:1 ScheduledReds:329 AssignedMaps:1 AssignedReds:670 
CompletedMaps:14257 CompletedReds:0 ContAlloc:14959 ContRel:30 HostLocal:14221 
RackLocal:38
03:40:05,025 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated 
containers 1
03:40:05,025 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned to reduce
03:40:05,025 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container 
container_1439640536651_4710_01_015151 to attempt_1439640536651_4710_r_000671_0
03:40:05,025 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
PendingReds:0 ScheduledMaps:1 ScheduledReds:328 AssignedMaps:1 AssignedReds:671 
CompletedMaps:14257 CompletedReds:0 ContAlloc:14960 ContRel:30 HostLocal:14221 
RackLocal:38
03:40:15,078 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated 
containers 1
03:40:15,078 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned to reduce
03:40:15,140 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container 
container_1439640536651_4710_01_015152 to attempt_1439640536651_4710_r_000672_0
03:40:15,140 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
PendingReds:0 ScheduledMaps:1 ScheduledReds:327 AssignedMaps:1 AssignedReds:672 
CompletedMaps:14257 CompletedReds:0 ContAlloc:14961 ContRel:30 HostLocal:14221 
RackLocal:38
03:40:19,161 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated 
containers 1
03:40:19,161 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned to reduce
03:40:19,161 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container 
container_1439640536651_4710_01_015153 to attempt_1439640536651_4710_r_000673_0
03:40:19,161 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
PendingReds:0 ScheduledMaps:1 ScheduledReds:326 AssignedMaps:1 AssignedReds:673 
CompletedMaps:14257 CompletedReds:0 ContAlloc:14962 ContRel:30 HostLocal:14221 
RackLocal:38
03:43:32,701 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received completed 
container container_1439640536651_4710_01_003425
03:43:32,701 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated 
containers 1
03:43:32,701 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned to reduce
03:43:32,701 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container 
container_1439640536651_4710_01_015154 to attempt_1439640536651_4710_r_000674_0
03:43:32,701 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
PendingReds:0 ScheduledMaps:1 ScheduledReds:325 AssignedMaps:0 AssignedReds:674 
CompletedMaps:14257 CompletedReds:0 ContAlloc:14963 ContRel:30 HostLocal:14221 
RackLocal:38
....
{code}



> MR job hanged forever because all resources are taken up by reducers and the 
> last map attempt never get resource to run
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6485
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6485
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster
>    Affects Versions: 3.0.0, 2.4.1, 2.6.0, 2.7.1
>            Reporter: Bob
>            Priority: Critical
>
> The scenarios is like this:
> With configuring mapreduce.job.reduce.slowstart.completedmaps=0.8, reduces 
> will take resource and  start to run when all the map have not finished. 
> But It could happened that when all the resources are taken up by running 
> reduces, there is still one map not finished. 
> Under this condition , the last map have two task attempts .
> As for the first attempt was killed due to timeout(mapreduce.task.timeout), 
> and its state transitioned from RUNNING to FAIL_CONTAINER_CLEANUP, so failed 
> map attempt would not be started. 
> As for the second attempt which was started due to having enable map task 
> speculative is pending at UNASSINGED state because of no resource available. 
> But the second map attempt request have lower priority than reduces, so 
> preemption would not happened.
> As a result all reduces would not finished because of there is one map left. 
> and the last map hanged there because of no resource available. so, the job 
> would never finish.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAPREDUCE-6485) MR job hanged forever because all resources are taken up by reducers and the last map attempt never get resource to run

Reply via email to