[
https://issues.apache.org/jira/browse/MAPREDUCE-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Subroto Sanyal updated MAPREDUCE-2129:
--------------------------------------
Labels: hadoop (was: )
Status: Patch Available (was: Open)
Aaah....got the problem :-)
The root cause for problem is:
Before scheduling reduces a check is made in JobInProgress: *boolean
org.apache.hadoop.mapred.JobInProgress.scheduleReduces()* which makes the
following check:
*finishedMapTasks >= completedMapsForReduceSlowstart*
This check is valid if the value of *mapred.max.map.failures.percent* is set to
0(zero/default value) but, when this value is set, the a/m check is invalid.
Say for example a Job spawns 100 Map task and the property value for
mapred.max.map.failures.percent is set to 5 percent. In this scenario even if
95 Maps are successful, reducers should be scheduled. Now if we look back the
a/m check, then the condition will not satisfy ever(if 5 map task fail) because
95 >= 100 will be always false.
As per my understanding the issue has nothing to with
*mapreduce.job.committer.setup.cleanup.needed*.
Kang,
We can't call the *void org.apache.hadoop.mapred.JobInProgress.jobComplete()*
from the *void org.apache.hadoop.mapred.JobInProgress.failedTask(TaskInProgress
tip, TaskAttemptID taskid, TaskStatus status, TaskTracker taskTracker, boolean
wasRunning, boolean wasComplete, boolean wasAttemptRunning)* as the method will
called upon failure of a Task but, we need to wait till
*completedMapsForReduceSlowstart* is reached,so that reduces are spawned.
In my scenario there was a job(wordcount) with 111 Mappers.
The value of *mapred.reduce.slowstart.completed.maps* was set to 1 (100%). The
value for *mapreduce.map.failures.maxpercent* was set to 5(5%).The Mapper
implementation was tweaked in such a way that 4 mappers failed.
After some time I noticed that 107 mappers got completed but, reduces are not
running and it got stuck for indefinite time.
> Job may hang if mapreduce.job.committer.setup.cleanup.needed=false and
> mapreduce.map/reduce.failures.maxpercent>0
> -----------------------------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-2129
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2129
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: jobtracker
> Affects Versions: 0.21.0, 0.20.2, 0.20.1, 0.20.3, 0.21.1, 0.22.0
> Reporter: Kang Xiao
> Labels: hadoop
> Attachments: MAPREDUCE-2129.patch, MAPREDUCE-2129.patch
>
>
> Job may hang at RUNNING state if
> mapreduce.job.committer.setup.cleanup.needed=false and
> mapreduce.map/reduce.failures.maxpercent>0. It happens when some tasks fail
> but havent reached failures.maxpercent.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira