[jira] [Commented] (MAPREDUCE-6485) MR job hanged forever because all resources are taken up by reducers and the last map attempt never get resource to run

Rohith Sharma K S (JIRA) Thu, 24 Sep 2015 23:38:39 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14907693#comment-14907693
 ]


Rohith Sharma K S commented on MAPREDUCE-6485:
----------------------------------------------

Thinking that TaskAttemptStateInternal which is exposed only testing but using 
in TaskImpl. I am not sure how much it is good to have this way. 
How about adding new method TaskAttemptImpl which check of NEW and UNASSIGNED 
states and return boolean value like below. New exposed method can be used in 
TaskImpl.
{noformat}
// The meaning of scheduled in life cycle of task is ResourceRequests which are 
sent to RM but not yet assigned.
private static final EnumSet<TaskAttemptStateInternal> SCHEDULED_TASK_STATES = 
EnumSet.of(TaskAttemptStateInternal.NEW,TaskAttemptStateInternal.UNASSIGNED);

public boolean isScheduledTask(){
return SCHEDULED_TASK_STATES.contains(getInternalState())
}
{noformat}

> MR job hanged forever because all resources are taken up by reducers and the 
> last map attempt never get resource to run
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6485
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6485
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster
>    Affects Versions: 3.0.0, 2.4.1, 2.6.0, 2.7.1
>            Reporter: Bob
>            Assignee: Xianyin Xin
>            Priority: Critical
>         Attachments: MAPREDUCE-6485.001.patch, MAPREDUCE-6845.002.patch
>
>
> The scenarios is like this:
> With configuring mapreduce.job.reduce.slowstart.completedmaps=0.8, reduces 
> will take resource and  start to run when all the map have not finished. 
> But It could happened that when all the resources are taken up by running 
> reduces, there is still one map not finished. 
> Under this condition , the last map have two task attempts .
> As for the first attempt was killed due to timeout(mapreduce.task.timeout), 
> and its state transitioned from RUNNING to FAIL_CONTAINER_CLEANUP then to 
> FAILED, but failed map attempt would not be restarted for there is still one 
> speculate map attempt in progressing. 
> As for the second attempt which was started due to having enable map task 
> speculative is pending at UNASSINGED state because of no resource available. 
> But the second map attempt request have lower priority than reduces, so 
> preemption would not happened.
> As a result all reduces would not finished because of there is one map left. 
> and the last map hanged there because of no resource available. so, the job 
> would never finish.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAPREDUCE-6485) MR job hanged forever because all resources are taken up by reducers and the last map attempt never get resource to run

Reply via email to