[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131973#comment-16131973
 ] 

YunFan Zhou commented on MAPREDUCE-6485:
----------------------------------------

[~varun_impala_149e] [~rohithsharma] Hi, Rohith Sharma K S,Karthik Kambatla.
Could you please help me with my problems?
The job I ran was hanged, but it was different from the scenario you 
encountered. What I've observed is that there are 15564 maps in total, and only 
one map is Pending. All reduce is pending because the map is not finished. But 
the resources for clustering and queues are very idle. The job was pending 
about 12 hours until I used the MR Cli to actively fail the map, and the job 
was finished normally.
To view AM's log, the following info has been reported:

{noformat}
2017-08-17 07:58:44,401 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all 
scheduled reduces:0
2017-08-17 07:58:44,401 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Going to preempt 1 
due to lack of space for maps
2017-08-17 07:58:44,401 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating 
schedule, headroom=<memory:814012, vCores:-1>
2017-08-17 07:58:44,401 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start 
threshold not met. completedMapsForReduceSlowstart 15564
{noformat}


> MR job hanged forever because all resources are taken up by reducers and the 
> last map attempt never get resource to run
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6485
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6485
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster
>    Affects Versions: 2.4.1, 2.6.0, 2.7.1, 3.0.0-alpha1
>            Reporter: Bob.zhao
>            Assignee: Xianyin Xin
>            Priority: Critical
>             Fix For: 2.8.0, 3.0.0-alpha1
>
>         Attachments: MAPREDUCE-6485.001.patch, MAPREDUCE-6485.004.patch, 
> MAPREDUCE-6485.005.patch, MAPREDUCE-6485.006.patch, MAPREDUCE-6845.002.patch, 
> MAPREDUCE-6845.003.patch
>
>
> The scenarios is like this:
> With configuring mapreduce.job.reduce.slowstart.completedmaps=0.8, reduces 
> will take resource and  start to run when all the map have not finished. 
> But It could happened that when all the resources are taken up by running 
> reduces, there is still one map not finished. 
> Under this condition , the last map have two task attempts .
> As for the first attempt was killed due to timeout(mapreduce.task.timeout), 
> and its state transitioned from RUNNING to FAIL_CONTAINER_CLEANUP then to 
> FAILED, but failed map attempt would not be restarted for there is still one 
> speculate map attempt in progressing. 
> As for the second attempt which was started due to having enable map task 
> speculative is pending at UNASSINGED state because of no resource available. 
> But the second map attempt request have lower priority than reduces, so 
> preemption would not happened.
> As a result all reduces would not finished because of there is one map left. 
> and the last map hanged there because of no resource available. so, the job 
> would never finish.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

Reply via email to