Re: reduce task failing after 24 hours waiting

Amar Kamat Wed, 25 Mar 2009 21:06:12 -0700

Amar Kamat wrote:

Amareshwari Sriramadasu wrote:
Set mapred.jobtracker.retirejob.interval
This is used to retire completed jobs.
and mapred.userlog.retain.hours to higher value.
This is used to discard user logs.

As Amareshwari pointed out, this might be the cause. Can you increasethis value and try?

Amar

By default, their values are 24 hours. These might be the reason forfailure, though I'm not sure.
Thanks
Amareshwari

Billy Pearson wrote:
I am seeing on one of my long running jobs about 50-60 hours thatafter 24 hours all
active reduce task fail with the error messages

java.io.IOException: Task process exit with nonzero status of 255.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)

Is there something in the config that I can change to stop this?

Every time with in 1 min of 24 hours they all fail at the same time.
waist a lot of resource downloading the map outputs and merging themagain.
What is the state of the reducer (copy or sort)? Checkjobtracker/task-tracker logs to see what is the state of thesereducers and whether it issued a kill signal. Eitherjobtracker/tasktracker is issuing a kill signal or the reducers arecommitting suicide. Were there any failures on the reducer side whilepulling the map output? Also what is the nature of the job? How fastthe maps finish?
Amar
Billy

Re: reduce task failing after 24 hours waiting

Reply via email to