[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960468#comment-14960468
 ] 

Varun Saxena commented on MAPREDUCE-6513:
-----------------------------------------

Took logs for analysis from Bob offline. The scenario is as under :
1. All the maps have completed.
{panel}
2015-10-13 04:38:42,229 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: *After Scheduling:* 
PendingReds:0 {color:red}ScheduledMaps:0{color} ScheduledReds:651 
{color:red}AssignedMaps:0{color} AssignedReds:0 
{color:red}CompletedMaps:78{color} CompletedReds:0 ContAlloc:79 ContRel:1 
HostLocal:64 RackLocal:14
{panel}
2. One node becomes unstable and hence some of the succeeded map tasks which 
ran on that node are killed
{noformat}
2015-10-13 04:53:41,127 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt killed because 
it ran on unusable node hdszzdcxdat6g05u06p:26009. 
AttemptId:attempt_1437451211867_1485_m_000077_0
2015-10-13 04:53:41,128 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt killed because 
it ran on unusable node hdszzdcxdat6g05u06p:26009. 
AttemptId:attempt_1437451211867_1485_m_000026_0
2015-10-13 04:53:41,128 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt killed because 
it ran on unusable node hdszzdcxdat6g05u06p:26009. 
AttemptId:attempt_1437451211867_1485_m_000007_0
2015-10-13 04:53:41,128 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt killed because 
it ran on unusable node hdszzdcxdat6g05u06p:26009. 
AttemptId:attempt_1437451211867_1485_m_000034_0
2015-10-13 04:53:41,128 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt killed because 
it ran on unusable node hdszzdcxdat6g05u06p:26009. 
AttemptId:attempt_1437451211867_1485_m_000015_0
2015-10-13 04:53:41,128 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt killed because 
it ran on unusable node hdszzdcxdat6g05u06p:26009. 
AttemptId:attempt_1437451211867_1485_m_000036_0
{noformat}

3. As can be seen below 16 maps are now scheduled
{panel}
2015-10-13 04:53:42,128 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: *Before 
Scheduling:* PendingReds:0 {color:red}ScheduledMaps:16{color} ScheduledReds:651 
{color:red}AssignedMaps:0{color} AssignedReds:0 
{color:red}CompletedMaps:62{color} CompletedReds:0 ContAlloc:79 ContRel:1 
HostLocal:64 RackLocal:14
{panel}

4. Node comes back up again after a while.

5. After this we keep on seeing that reducers keep on getting preempted, 
scheduled and this goes on and on in a cycle. And mappers are never 
assigned(due to lower priority).
{noformat}
2015-10-13 04:38:40,219 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: 
PendingReds:0 ScheduledMaps:0 ScheduledReds:651 AssignedMaps:2 AssignedReds:0 
CompletedMaps:78 CompletedReds:0 ContAlloc:79 ContRel:1 HostLocal:64 
RackLocal:14
2015-10-13 04:38:40,223 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
PendingReds:0 ScheduledMaps:0 ScheduledReds:651 AssignedMaps:1 AssignedReds:0 
CompletedMaps:78 CompletedReds:0 ContAlloc:79 ContRel:1 HostLocal:64 
RackLocal:14
2015-10-13 04:38:42,229 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
PendingReds:0 ScheduledMaps:0 ScheduledReds:651 AssignedMaps:0 AssignedReds:0 
CompletedMaps:78 CompletedReds:0 ContAlloc:79 ContRel:1 HostLocal:64 
RackLocal:14
2015-10-13 04:53:42,128 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: 
PendingReds:0 ScheduledMaps:16 ScheduledReds:651 AssignedMaps:0 AssignedReds:0 
CompletedMaps:62 CompletedReds:0 ContAlloc:79 ContRel:1 HostLocal:64 
RackLocal:14
2015-10-13 04:53:42,132 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
PendingReds:651 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:0 
CompletedMaps:62 CompletedReds:0 ContAlloc:79 ContRel:1 HostLocal:64 
RackLocal:14
2015-10-13 04:54:49,433 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
PendingReds:651 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:0 
CompletedMaps:62 CompletedReds:0 ContAlloc:84 ContRel:6 HostLocal:64 
RackLocal:14
2015-10-13 04:54:50,451 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
PendingReds:651 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:0 
CompletedMaps:62 CompletedReds:0 ContAlloc:90 ContRel:12 HostLocal:64 
RackLocal:14
2015-10-13 04:54:51,470 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
PendingReds:651 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:0 
CompletedMaps:62 CompletedReds:0 ContAlloc:95 ContRel:17 HostLocal:64 
RackLocal:14
2015-10-13 04:54:52,501 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
PendingReds:651 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:0 
CompletedMaps:62 CompletedReds:0 ContAlloc:114 ContRel:36 HostLocal:64 
RackLocal:14
2015-10-13 04:54:53,553 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
PendingReds:651 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:0 
CompletedMaps:62 CompletedReds:0 ContAlloc:129 ContRel:51 HostLocal:64 
RackLocal:14
2015-10-13 04:54:54,657 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
PendingReds:651 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:0 
CompletedMaps:62 CompletedReds:0 ContAlloc:147 ContRel:69 HostLocal:64 
RackLocal:14
2015-10-13 04:54:55,708 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
PendingReds:651 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:0 
CompletedMaps:62 CompletedReds:0 ContAlloc:155 ContRel:77 HostLocal:64 
RackLocal:14
.......

2015-10-13 04:55:04,912 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
PendingReds:641 ScheduledMaps:16 ScheduledReds:10 AssignedMaps:0 AssignedReds:0 
CompletedMaps:62 CompletedReds:0 ContAlloc:216 ContRel:138 HostLocal:64 
RackLocal:14
2015-10-13 04:55:05,923 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
PendingReds:636 ScheduledMaps:16 ScheduledReds:10 AssignedMaps:0 AssignedReds:5 
CompletedMaps:62 CompletedReds:0 ContAlloc:221 ContRel:138 HostLocal:64 
RackLocal:14
2015-10-13 04:55:06,929 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
PendingReds:631 ScheduledMaps:16 ScheduledReds:9 AssignedMaps:0 AssignedReds:11 
CompletedMaps:62 CompletedReds:0 ContAlloc:227 ContRel:138 HostLocal:64 
RackLocal:14
2015-10-13 04:55:07,945 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
PendingReds:630 ScheduledMaps:16 ScheduledReds:4 AssignedMaps:0 AssignedReds:17 
CompletedMaps:62 CompletedReds:0 ContAlloc:233 ContRel:138 HostLocal:64 
RackLocal:14
2015-10-13 04:55:08,967 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
PendingReds:630 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:21 
CompletedMaps:62 CompletedReds:0 ContAlloc:238 ContRel:139 HostLocal:64 
RackLocal:14
2015-10-13 04:55:09,967 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: 
PendingReds:641 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:21 
CompletedMaps:62 CompletedReds:0 ContAlloc:238 ContRel:139 HostLocal:64 
RackLocal:14
2015-10-13 04:55:09,979 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
PendingReds:641 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:16 
CompletedMaps:62 CompletedReds:0 ContAlloc:253 ContRel:154 HostLocal:64 
RackLocal:14
2015-10-13 04:55:11,013 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
PendingReds:641 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:10 
CompletedMaps:62 CompletedReds:0 ContAlloc:260 ContRel:161 HostLocal:64 
RackLocal:14
2015-10-13 04:55:12,013 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: 
PendingReds:646 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:10 
CompletedMaps:62 CompletedReds:0 ContAlloc:260 ContRel:161 HostLocal:64 
RackLocal:14
2015-10-13 04:55:12,031 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
PendingReds:636 ScheduledMaps:16 ScheduledReds:10 AssignedMaps:0 AssignedReds:8 
CompletedMaps:62 CompletedReds:0 ContAlloc:267 ContRel:168 HostLocal:64 
RackLocal:14
2015-10-13 04:55:13,053 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
PendingReds:624 ScheduledMaps:16 ScheduledReds:15 AssignedMaps:0 
AssignedReds:12 CompletedMaps:62 CompletedReds:0 ContAlloc:274 ContRel:168 
HostLocal:64 RackLocal:14
2015-10-13 04:55:14,061 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
PendingReds:614 ScheduledMaps:16 ScheduledReds:18 AssignedMaps:0 
AssignedReds:19 CompletedMaps:62 CompletedReds:0 ContAlloc:281 ContRel:168 
HostLocal:64 RackLocal:14
....

2015-10-13 04:58:18,813 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: 
PendingReds:623 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:44 
CompletedMaps:62 CompletedReds:0 ContAlloc:1372 ContRel:964 HostLocal:64 
RackLocal:14
2015-10-13 04:58:18,830 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
PendingReds:623 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:43 
CompletedMaps:62 CompletedReds:0 ContAlloc:1386 ContRel:978 HostLocal:64 
RackLocal:14
2015-10-13 04:58:19,855 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
PendingReds:615 ScheduledMaps:16 ScheduledReds:8 AssignedMaps:0 AssignedReds:32 
CompletedMaps:62 CompletedReds:0 ContAlloc:1394 ContRel:986 HostLocal:64 
RackLocal:14
2015-10-13 04:58:20,877 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
PendingReds:614 ScheduledMaps:16 ScheduledReds:3 AssignedMaps:0 AssignedReds:38 
CompletedMaps:62 CompletedReds:0 ContAlloc:1400 ContRel:986 HostLocal:64 
RackLocal:14
2015-10-13 04:58:21,890 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
PendingReds:605 ScheduledMaps:16 ScheduledReds:9 AssignedMaps:0 AssignedReds:38 
CompletedMaps:62 CompletedReds:0 ContAlloc:1405 ContRel:988 HostLocal:64 
RackLocal:14
2015-10-13 04:58:22,897 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
PendingReds:605 ScheduledMaps:16 ScheduledReds:4 AssignedMaps:0 AssignedReds:43 
CompletedMaps:62 CompletedReds:0 ContAlloc:1410 ContRel:988 HostLocal:64 
RackLocal:14
...

2015-10-13 04:58:18,813 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: 
PendingReds:623 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:44 
CompletedMaps:62 CompletedReds:0 ContAlloc:1372 ContRel:964 HostLocal:64 
RackLocal:14
2015-10-13 04:58:18,830 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
PendingReds:623 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:43 
CompletedMaps:62 CompletedReds:0 ContAlloc:1386 ContRel:978 HostLocal:64 
RackLocal:14
2015-10-13 04:58:19,855 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
PendingReds:615 ScheduledMaps:16 ScheduledReds:8 AssignedMaps:0 AssignedReds:32 
CompletedMaps:62 CompletedReds:0 ContAlloc:1394 ContRel:986 HostLocal:64 
RackLocal:14
2015-10-13 04:58:20,877 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
PendingReds:614 ScheduledMaps:16 ScheduledReds:3 AssignedMaps:0 AssignedReds:38 
CompletedMaps:62 CompletedReds:0 ContAlloc:1400 ContRel:986 HostLocal:64 
RackLocal:14
2015-10-13 04:58:21,890 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
PendingReds:605 ScheduledMaps:16 ScheduledReds:9 AssignedMaps:0 AssignedReds:38 
CompletedMaps:62 CompletedReds:0 ContAlloc:1405 ContRel:988 HostLocal:64 
RackLocal:14
2015-10-13 04:58:22,897 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: 
PendingReds:605 ScheduledMaps:16 ScheduledReds:4 AssignedMaps:0 AssignedReds:43 
CompletedMaps:62 CompletedReds:0 ContAlloc:1410 ContRel:988 HostLocal:64 
RackLocal:14
{noformat}

                                                                                
                                                

> MR job got hanged forever when one NM unstable for some time
> ------------------------------------------------------------
>
>                 Key: MAPREDUCE-6513
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6513
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, resourcemanager
>    Affects Versions: 2.7.0
>            Reporter: Bob
>            Assignee: Varun Saxena
>            Priority: Critical
>
> when job is in-progress which is having more tasks,one node became unstable 
> due to some OS issue.After the node became unstable, the map on this node 
> status changed to KILLED state. 
> Currently maps which were running on unstable node are rescheduled, and all 
> are in scheduled state and wait for RM assign container.Seen ask requests for 
> map till Node is good (all those failed), there are no ask request after 
> this. But AM keeps on preempting the reducers (it's recycling).
> Finally reducers are waiting for complete mappers and mappers did n't get 
> container..
> My Question Is:
> ============
> why map requests did not sent AM ,once after node recovery.?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to