Super-long reduce task timeouts in hadoop-0.19.0

Bryan Duxbury Fri, 20 Feb 2009 10:42:06 -0800

I noticed some really odd behavior today while reviewing the jobhistory of some of our jobs. Our Ganglia graphs showed really longperiods of inactivity across the entire cluster, which shoulddefinitely not be the case - we have a really long string of jobs inour workflow that should execute one after another. I figured outwhich jobs were running during those periods of inactivity, anddiscovered that almost all of them had 4-5 failed reduce tasks, withthe reason for failure being something like:

Task attempt_200902061117_3382_r_000038_0 failed to report status for1282 seconds. Killing!

The actual timeout reported varies from 700-5000 seconds. Virtuallyall of our longer-running jobs were affected by this problem. Theperiod of inactivity on the cluster seems to correspond to the amountof time the job waited for these reduce tasks to fail.

I checked out the tasktracker log for the machines with timed-outreduce tasks looking for something that might explain the problem,but the only thing I came up with that actually referenced the failedtask was this log message, which was repeated many times:

2009-02-19 22:48:19,380 INFO org.apache.hadoop.mapred.TaskTracker:org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not findtaskTracker/jobcache/job_200902061117_3388/attempt_200902061117_3388_r_000066_0/output/file.out in any of theconfigured local directories

I'm not sure what this means; can anyone shed some light on thismessage?

Further confusing the issue, on the affected machines, I looked inlogs/userlogs/<task id>, and to my surprise, the directory and logfiles existed, and the syslog file seemed to contain logs of aperfectly good reduce task!

Overall, this seems like a pretty critical bug. It's consuming up to50% of the runtime of our jobs in some instances, killing ourthroughput. At the very least, it seems like the reduce task timeoutperiod should be MUCH shorter than the current 10-20 minutes.


-Bryan

Super-long reduce task timeouts in hadoop-0.19.0

Reply via email to