[
https://issues.apache.org/jira/browse/MAPREDUCE-6944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451623#comment-17451623
]
zhangyangyang commented on MAPREDUCE-6944:
------------------------------------------
[~daemon] [~luxianghao] for 2.6, there isn't function applyRequestLimits() ,
and when i kill the stuck taskAttempt manually,the new attempt can be assigned
& run .
> MR job got hanged forever when some NMs unstable for some time
> --------------------------------------------------------------
>
> Key: MAPREDUCE-6944
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6944
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster, resourcemanager
> Affects Versions: 2.7.2
> Reporter: YunFan Zhou
> Priority: Critical
> Attachments: screenshot-1.png
>
>
> We encountered several jobs in the production environment due to the fact
> that some of the NM unstable cause one *MAP* of the job to be stuck there,
> and the job can't finish properly.
> However, the problems we encountered were different from those mentioned in
> [https://issues.apache.org/jira/browse/MAPREDUCE-6513]. Because in our
> scenario, all of *MR REDUCEs* does not start executing.
> But when I manually kill the hanged *MAP*, the job will be finished normally.
> {noformat}
> 2017-08-17 12:25:06,548 INFO [RMCommunicator Allocator]
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start
> threshold not met. completedMapsForReduceSlowstart 15564
> 2017-08-17 12:25:07,555 INFO [RMCommunicator Allocator]
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received
> completed container container_e84_1502793246072_73922_01_015700
> 2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator]
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating
> schedule, headroom=<memory:2218677, vCores:2225>
> 2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator]
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start
> threshold not met. completedMapsForReduceSlowstart 15564
> 2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator]
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling:
> PendingReds:1009 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:0
> AssignedReds:0 CompletedMaps:15563 CompletedReds:0 ContAlloc:15723 ContRel:26
> HostLocal:4575 RackLocal:8121
> {noformat}
> {noformat}
> 2017-08-17 14:49:41,793 INFO [RMCommunicator Allocator]
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before
> Scheduling: PendingReds:1009 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:1
> AssignedReds:0 CompletedMaps:15563 CompletedReds:0 ContAlloc:15724 ContRel:26
> HostLocal:4575 RackLocal:8121
> 2017-08-17 14:49:41,794 INFO [RMCommunicator Allocator]
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: Applying ask
> limit of 1 for priority:5 and capability:<memory:1024, vCores:1>
> 2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator]
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: getResources()
> for application_1502793246072_73922: ask=1 release= 0 newContainers=0
> finishedContainers=0 resourcelimit=<memory:1711989, vCores:1688> knownNMs=4236
> 2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator]
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating
> schedule, headroom=<memory:1711989, vCores:1688>
> 2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator]
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start
> threshold not met. completedMapsForReduceSlowstart 15564
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator]
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated
> containers 1
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator]
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigning
> container Container: [ContainerId:
> container_e84_1502793246072_73922_01_015726, NodeId:
> bigdata-hdp-apache1960.xg01.diditaxi.com:8041, NodeHttpAddress:
> bigdata-hdp-apache1960.xg01.diditaxi.com:8042, Resource: <memory:1024,
> vCores:1>, Priority: 5, Token: Token { kind: ContainerToken, service:
> 10.93.111.36:8041 }, ] to fast fail map
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator]
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned from
> earlierFailedMaps
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator]
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned
> container container_e84_1502793246072_73922_01_015726 to
> attempt_1502793246072_73922_m_012103_5
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator]
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating
> schedule, headroom=<memory:1727349, vCores:1703>
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator]
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start
> threshold not met. completedMapsForReduceSlowstart 15564
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator]
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling:
> PendingReds:1009 ScheduledMaps:0 ScheduledReds:0 AssignedMaps:2
> AssignedReds:0 CompletedMaps:15563 CompletedReds:0 ContAlloc:15725 ContRel:26
> HostLocal:4575 RackLocal:8121
> {noformat}
> {noformat}
> !screenshot-1.png!
> {noformat}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]