Re: Super-long reduce task timeouts in hadoop-0.19.0

Bryan Duxbury Fri, 20 Feb 2009 22:53:01 -0800

We didn't customize this value, to my knowledge, so I'd suspect it'sthe default.

-Bryan

On Feb 20, 2009, at 5:00 PM, Ted Dunning wrote:

How often do your reduce tasks report status?
On Fri, Feb 20, 2009 at 3:58 PM, Bryan Duxbury <br...@rapleaf.com>wrote:
(Repost from the dev list)
I noticed some really odd behavior today while reviewing the jobhistory of
some of our jobs. Our Ganglia graphs showed really long periods of
inactivity across the entire cluster, which should definitely notbe thecase - we have a really long string of jobs in our workflow thatshouldexecute one after another. I figured out which jobs were runningduringthose periods of inactivity, and discovered that almost all ofthem had 4-5failed reduce tasks, with the reason for failure being somethinglike:
Task attempt_200902061117_3382_r_000038_0 failed to report statusfor 1282
seconds. Killing!
The actual timeout reported varies from 700-5000 seconds.Virtually all of
our longer-running jobs were affected by this problem. The period of
inactivity on the cluster seems to correspond to the amount oftime the job
waited for these reduce tasks to fail.
I checked out the tasktracker log for the machines with timed-outreducetasks looking for something that might explain the problem, butthe onlything I came up with that actually referenced the failed task wasthis log
message, which was repeated many times:

2009-02-19 22:48:19,380 INFO org.apache.hadoop.mapred.TaskTracker:
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
taskTracker/jobcache/job_200902061117_3388/attempt_200902061117_3388_r_000066_0/output/file.out
in any of the configured local directories
I'm not sure what this means; can anyone shed some light on thismessage?
Further confusing the issue, on the affected machines, I looked in
logs/userlogs/<task id>, and to my surprise, the directory and logfilesexisted, and the syslog file seemed to contain logs of a perfectlygood
reduce task!
Overall, this seems like a pretty critical bug. It's consuming upto 50% ofthe runtime of our jobs in some instances, killing our throughput.At thevery least, it seems like the reduce task timeout period should beMUCH
shorter than the current 10-20 minutes.

-Bryan
--
Ted Dunning, CTO
DeepDyve

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
www.deepdyve.com
408-773-0110 ext. 738
858-414-0013 (m)
408-773-0220 (fax)

Re: Super-long reduce task timeouts in hadoop-0.19.0

Reply via email to