David Kellogg wrote:
I am running Hadoop streaming. After around 42 jobs on an 18-node
cluster, the jobtracker stops responding. This happens on
normally-working code. Here are the symptoms.
1. A job is running, but it pauses with reduce stuck at XX%
2. "hadoop job -list" hangs or takes a very long time to return
3. In the Ganglia metrics on the Jobtracker node:
a. jvm.metrics__JobTracker__gcTimeMillis rises above 20 k (20
seconds) before failure
b. jvm.metrics__JobTracker__memHeapUsedM rises above 600 before
failure
c. jvm.metrics__JobTracker__gcCount rises above 1 k before failure
The ticker looks like this.
09/04/06 03:06:28 INFO streaming.StreamJob: map 24% reduce 7%
09/04/06 03:13:44 INFO streaming.StreamJob: map 25% reduce 7%
After the 03:13:44 line, it hangs for more than 15 minutes.
In the jobtracker log, I see this.
2009-04-04 04:19:13,563 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_-8143535428142072268_95993 failed because
recovery from primary datanode 10.1.0.156:50010 failed 4 times. Will
retry...
After restarting both dfs and mapreduce on all nodes, the problem goes
away, and the formally non-working job proceeds without failure.
David,
What version are you using?
There can be because of :
1) Number of tasks in jobtracker's memory might exceed its limits. What
is the total number of tasks in the jobtracker's memory? What is the
jobtracker's heap size? Try increasing the heap size and also try
setting the mapred.jobtracker.completeuserjobs.maximum parameter to some
low value.
2) Sometimes some slow/bad datanode causes the jobtracker to get stuck.
As you have mentioned this might be the cause. Can you let us know the
output of 'kill -3' on jobtracker process.
Does anyone else see this problem?
David Kellogg