[
https://issues.apache.org/jira/browse/MAPREDUCE-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960468#comment-14960468
]
Varun Saxena commented on MAPREDUCE-6513:
-----------------------------------------
Took logs for analysis from Bob offline. The scenario is as under :
1. All the maps have completed.
{panel}
2015-10-13 04:38:42,229 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: *After Scheduling:*
PendingReds:0 {color:red}ScheduledMaps:0{color} ScheduledReds:651
{color:red}AssignedMaps:0{color} AssignedReds:0
{color:red}CompletedMaps:78{color} CompletedReds:0 ContAlloc:79 ContRel:1
HostLocal:64 RackLocal:14
{panel}
2. One node becomes unstable and hence some of the succeeded map tasks which
ran on that node are killed
{noformat}
2015-10-13 04:53:41,127 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt killed because
it ran on unusable node hdszzdcxdat6g05u06p:26009.
AttemptId:attempt_1437451211867_1485_m_000077_0
2015-10-13 04:53:41,128 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt killed because
it ran on unusable node hdszzdcxdat6g05u06p:26009.
AttemptId:attempt_1437451211867_1485_m_000026_0
2015-10-13 04:53:41,128 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt killed because
it ran on unusable node hdszzdcxdat6g05u06p:26009.
AttemptId:attempt_1437451211867_1485_m_000007_0
2015-10-13 04:53:41,128 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt killed because
it ran on unusable node hdszzdcxdat6g05u06p:26009.
AttemptId:attempt_1437451211867_1485_m_000034_0
2015-10-13 04:53:41,128 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt killed because
it ran on unusable node hdszzdcxdat6g05u06p:26009.
AttemptId:attempt_1437451211867_1485_m_000015_0
2015-10-13 04:53:41,128 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt killed because
it ran on unusable node hdszzdcxdat6g05u06p:26009.
AttemptId:attempt_1437451211867_1485_m_000036_0
{noformat}
3. As can be seen below 16 maps are now scheduled
{panel}
2015-10-13 04:53:42,128 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: *Before
Scheduling:* PendingReds:0 {color:red}ScheduledMaps:16{color} ScheduledReds:651
{color:red}AssignedMaps:0{color} AssignedReds:0
{color:red}CompletedMaps:62{color} CompletedReds:0 ContAlloc:79 ContRel:1
HostLocal:64 RackLocal:14
{panel}
4. Node comes back up again after a while.
5. After this we keep on seeing that reducers keep on getting preempted,
scheduled and this goes on and on in a cycle. And mappers are never
assigned(due to lower priority).
{noformat}
2015-10-13 04:38:40,219 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling:
PendingReds:0 ScheduledMaps:0 ScheduledReds:651 AssignedMaps:2 AssignedReds:0
CompletedMaps:78 CompletedReds:0 ContAlloc:79 ContRel:1 HostLocal:64
RackLocal:14
2015-10-13 04:38:40,223 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling:
PendingReds:0 ScheduledMaps:0 ScheduledReds:651 AssignedMaps:1 AssignedReds:0
CompletedMaps:78 CompletedReds:0 ContAlloc:79 ContRel:1 HostLocal:64
RackLocal:14
2015-10-13 04:38:42,229 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling:
PendingReds:0 ScheduledMaps:0 ScheduledReds:651 AssignedMaps:0 AssignedReds:0
CompletedMaps:78 CompletedReds:0 ContAlloc:79 ContRel:1 HostLocal:64
RackLocal:14
2015-10-13 04:53:42,128 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling:
PendingReds:0 ScheduledMaps:16 ScheduledReds:651 AssignedMaps:0 AssignedReds:0
CompletedMaps:62 CompletedReds:0 ContAlloc:79 ContRel:1 HostLocal:64
RackLocal:14
2015-10-13 04:53:42,132 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling:
PendingReds:651 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:0
CompletedMaps:62 CompletedReds:0 ContAlloc:79 ContRel:1 HostLocal:64
RackLocal:14
2015-10-13 04:54:49,433 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling:
PendingReds:651 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:0
CompletedMaps:62 CompletedReds:0 ContAlloc:84 ContRel:6 HostLocal:64
RackLocal:14
2015-10-13 04:54:50,451 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling:
PendingReds:651 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:0
CompletedMaps:62 CompletedReds:0 ContAlloc:90 ContRel:12 HostLocal:64
RackLocal:14
2015-10-13 04:54:51,470 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling:
PendingReds:651 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:0
CompletedMaps:62 CompletedReds:0 ContAlloc:95 ContRel:17 HostLocal:64
RackLocal:14
2015-10-13 04:54:52,501 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling:
PendingReds:651 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:0
CompletedMaps:62 CompletedReds:0 ContAlloc:114 ContRel:36 HostLocal:64
RackLocal:14
2015-10-13 04:54:53,553 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling:
PendingReds:651 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:0
CompletedMaps:62 CompletedReds:0 ContAlloc:129 ContRel:51 HostLocal:64
RackLocal:14
2015-10-13 04:54:54,657 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling:
PendingReds:651 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:0
CompletedMaps:62 CompletedReds:0 ContAlloc:147 ContRel:69 HostLocal:64
RackLocal:14
2015-10-13 04:54:55,708 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling:
PendingReds:651 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:0
CompletedMaps:62 CompletedReds:0 ContAlloc:155 ContRel:77 HostLocal:64
RackLocal:14
.......
2015-10-13 04:55:04,912 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling:
PendingReds:641 ScheduledMaps:16 ScheduledReds:10 AssignedMaps:0 AssignedReds:0
CompletedMaps:62 CompletedReds:0 ContAlloc:216 ContRel:138 HostLocal:64
RackLocal:14
2015-10-13 04:55:05,923 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling:
PendingReds:636 ScheduledMaps:16 ScheduledReds:10 AssignedMaps:0 AssignedReds:5
CompletedMaps:62 CompletedReds:0 ContAlloc:221 ContRel:138 HostLocal:64
RackLocal:14
2015-10-13 04:55:06,929 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling:
PendingReds:631 ScheduledMaps:16 ScheduledReds:9 AssignedMaps:0 AssignedReds:11
CompletedMaps:62 CompletedReds:0 ContAlloc:227 ContRel:138 HostLocal:64
RackLocal:14
2015-10-13 04:55:07,945 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling:
PendingReds:630 ScheduledMaps:16 ScheduledReds:4 AssignedMaps:0 AssignedReds:17
CompletedMaps:62 CompletedReds:0 ContAlloc:233 ContRel:138 HostLocal:64
RackLocal:14
2015-10-13 04:55:08,967 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling:
PendingReds:630 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:21
CompletedMaps:62 CompletedReds:0 ContAlloc:238 ContRel:139 HostLocal:64
RackLocal:14
2015-10-13 04:55:09,967 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling:
PendingReds:641 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:21
CompletedMaps:62 CompletedReds:0 ContAlloc:238 ContRel:139 HostLocal:64
RackLocal:14
2015-10-13 04:55:09,979 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling:
PendingReds:641 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:16
CompletedMaps:62 CompletedReds:0 ContAlloc:253 ContRel:154 HostLocal:64
RackLocal:14
2015-10-13 04:55:11,013 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling:
PendingReds:641 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:10
CompletedMaps:62 CompletedReds:0 ContAlloc:260 ContRel:161 HostLocal:64
RackLocal:14
2015-10-13 04:55:12,013 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling:
PendingReds:646 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:10
CompletedMaps:62 CompletedReds:0 ContAlloc:260 ContRel:161 HostLocal:64
RackLocal:14
2015-10-13 04:55:12,031 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling:
PendingReds:636 ScheduledMaps:16 ScheduledReds:10 AssignedMaps:0 AssignedReds:8
CompletedMaps:62 CompletedReds:0 ContAlloc:267 ContRel:168 HostLocal:64
RackLocal:14
2015-10-13 04:55:13,053 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling:
PendingReds:624 ScheduledMaps:16 ScheduledReds:15 AssignedMaps:0
AssignedReds:12 CompletedMaps:62 CompletedReds:0 ContAlloc:274 ContRel:168
HostLocal:64 RackLocal:14
2015-10-13 04:55:14,061 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling:
PendingReds:614 ScheduledMaps:16 ScheduledReds:18 AssignedMaps:0
AssignedReds:19 CompletedMaps:62 CompletedReds:0 ContAlloc:281 ContRel:168
HostLocal:64 RackLocal:14
....
2015-10-13 04:58:18,813 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling:
PendingReds:623 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:44
CompletedMaps:62 CompletedReds:0 ContAlloc:1372 ContRel:964 HostLocal:64
RackLocal:14
2015-10-13 04:58:18,830 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling:
PendingReds:623 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:43
CompletedMaps:62 CompletedReds:0 ContAlloc:1386 ContRel:978 HostLocal:64
RackLocal:14
2015-10-13 04:58:19,855 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling:
PendingReds:615 ScheduledMaps:16 ScheduledReds:8 AssignedMaps:0 AssignedReds:32
CompletedMaps:62 CompletedReds:0 ContAlloc:1394 ContRel:986 HostLocal:64
RackLocal:14
2015-10-13 04:58:20,877 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling:
PendingReds:614 ScheduledMaps:16 ScheduledReds:3 AssignedMaps:0 AssignedReds:38
CompletedMaps:62 CompletedReds:0 ContAlloc:1400 ContRel:986 HostLocal:64
RackLocal:14
2015-10-13 04:58:21,890 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling:
PendingReds:605 ScheduledMaps:16 ScheduledReds:9 AssignedMaps:0 AssignedReds:38
CompletedMaps:62 CompletedReds:0 ContAlloc:1405 ContRel:988 HostLocal:64
RackLocal:14
2015-10-13 04:58:22,897 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling:
PendingReds:605 ScheduledMaps:16 ScheduledReds:4 AssignedMaps:0 AssignedReds:43
CompletedMaps:62 CompletedReds:0 ContAlloc:1410 ContRel:988 HostLocal:64
RackLocal:14
...
2015-10-13 04:58:18,813 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling:
PendingReds:623 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:44
CompletedMaps:62 CompletedReds:0 ContAlloc:1372 ContRel:964 HostLocal:64
RackLocal:14
2015-10-13 04:58:18,830 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling:
PendingReds:623 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:43
CompletedMaps:62 CompletedReds:0 ContAlloc:1386 ContRel:978 HostLocal:64
RackLocal:14
2015-10-13 04:58:19,855 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling:
PendingReds:615 ScheduledMaps:16 ScheduledReds:8 AssignedMaps:0 AssignedReds:32
CompletedMaps:62 CompletedReds:0 ContAlloc:1394 ContRel:986 HostLocal:64
RackLocal:14
2015-10-13 04:58:20,877 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling:
PendingReds:614 ScheduledMaps:16 ScheduledReds:3 AssignedMaps:0 AssignedReds:38
CompletedMaps:62 CompletedReds:0 ContAlloc:1400 ContRel:986 HostLocal:64
RackLocal:14
2015-10-13 04:58:21,890 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling:
PendingReds:605 ScheduledMaps:16 ScheduledReds:9 AssignedMaps:0 AssignedReds:38
CompletedMaps:62 CompletedReds:0 ContAlloc:1405 ContRel:988 HostLocal:64
RackLocal:14
2015-10-13 04:58:22,897 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling:
PendingReds:605 ScheduledMaps:16 ScheduledReds:4 AssignedMaps:0 AssignedReds:43
CompletedMaps:62 CompletedReds:0 ContAlloc:1410 ContRel:988 HostLocal:64
RackLocal:14
{noformat}
> MR job got hanged forever when one NM unstable for some time
> ------------------------------------------------------------
>
> Key: MAPREDUCE-6513
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6513
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster, resourcemanager
> Affects Versions: 2.7.0
> Reporter: Bob
> Assignee: Varun Saxena
> Priority: Critical
>
> when job is in-progress which is having more tasks,one node became unstable
> due to some OS issue.After the node became unstable, the map on this node
> status changed to KILLED state.
> Currently maps which were running on unstable node are rescheduled, and all
> are in scheduled state and wait for RM assign container.Seen ask requests for
> map till Node is good (all those failed), there are no ask request after
> this. But AM keeps on preempting the reducers (it's recycling).
> Finally reducers are waiting for complete mappers and mappers did n't get
> container..
> My Question Is:
> ============
> why map requests did not sent AM ,once after node recovery.?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)