Re: Job tracker not responding during streaming job

Amar Kamat Mon, 06 Apr 2009 21:43:43 -0700

David Kellogg wrote:

I am running Hadoop streaming. After around 42 jobs on an 18-nodecluster, the jobtracker stops responding. This happens onnormally-working code. Here are the symptoms.
1. A job is running, but it pauses with reduce stuck at XX%
2. "hadoop job -list" hangs or takes a very long time to return
3. In the Ganglia metrics on the Jobtracker node:
a. jvm.metrics__JobTracker__gcTimeMillis rises above 20 k (20seconds) before failureb. jvm.metrics__JobTracker__memHeapUsedM rises above 600 beforefailure
     c. jvm.metrics__JobTracker__gcCount rises above 1 k before failure


The ticker looks like this.

09/04/06 03:06:28 INFO streaming.StreamJob:  map 24%  reduce 7%
09/04/06 03:13:44 INFO streaming.StreamJob:  map 25%  reduce 7%
After the 03:13:44 line, it hangs for more than 15 minutes.

In the jobtracker log, I see this.
2009-04-04 04:19:13,563 WARN org.apache.hadoop.hdfs.DFSClient: ErrorRecovery for block blk_-8143535428142072268_95993 failed becauserecovery from primary datanode 10.1.0.156:50010 failed 4 times. Willretry...
After restarting both dfs and mapreduce on all nodes, the problem goesaway, and the formally non-working job proceeds without failure.

David,
What version are you using?
There can be because of :

1) Number of tasks in jobtracker's memory might exceed its limits. Whatis the total number of tasks in the jobtracker's memory? What is thejobtracker's heap size? Try increasing the heap size and also trysetting the mapred.jobtracker.completeuserjobs.maximum parameter to somelow value.2) Sometimes some slow/bad datanode causes the jobtracker to get stuck.As you have mentioned this might be the cause. Can you let us know theoutput of 'kill -3' on jobtracker process.


Does anyone else see this problem?

David Kellogg

Re: Job tracker not responding during streaming job

Reply via email to