David Kellogg wrote:
I am running Hadoop streaming. After around 42 jobs on an 18-node cluster, the jobtracker stops responding. This happens on normally-working code. Here are the symptoms.

1. A job is running, but it pauses with reduce stuck at XX%
2. "hadoop job -list" hangs or takes a very long time to return
3. In the Ganglia metrics on the Jobtracker node:
a. jvm.metrics__JobTracker__gcTimeMillis rises above 20 k (20 seconds) before failure b. jvm.metrics__JobTracker__memHeapUsedM rises above 600 before failure
     c. jvm.metrics__JobTracker__gcCount rises above 1 k before failure


The ticker looks like this.

09/04/06 03:06:28 INFO streaming.StreamJob:  map 24%  reduce 7%
09/04/06 03:13:44 INFO streaming.StreamJob:  map 25%  reduce 7%
After the 03:13:44 line, it hangs for more than 15 minutes.

In the jobtracker log, I see this.

2009-04-04 04:19:13,563 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-8143535428142072268_95993 failed because recovery from primary datanode 10.1.0.156:50010 failed 4 times. Will retry...

After restarting both dfs and mapreduce on all nodes, the problem goes away, and the formally non-working job proceeds without failure.
David,
What version are you using?
There can be because of :
1) Number of tasks in jobtracker's memory might exceed its limits. What is the total number of tasks in the jobtracker's memory? What is the jobtracker's heap size? Try increasing the heap size and also try setting the mapred.jobtracker.completeuserjobs.maximum parameter to some low value. 2) Sometimes some slow/bad datanode causes the jobtracker to get stuck. As you have mentioned this might be the cause. Can you let us know the output of 'kill -3' on jobtracker process.

Does anyone else see this problem?

David Kellogg

Reply via email to