Thanks.

It is Hadoop 0.19.0. It looks like a peak of 90 concurrent map tasks and 54 concurrent reduce tasks, not necessarily at the same time. You asked how many are held in memory. I think the number is 100, the default.

It looks like 130 MB peak heap size used on the jobtracker. The default 1000 is allocated, currently.

I set mapred.jobtracker.completeuserjobs.maximum=5
I increased HADOOP_HEAPSIZE to 2000 in hadoop-env.sh.
I included the "kill -3" dump from the jobtracker, far below.

I'll check back in 3 days, if and when the jobtracker fails again.

Dave




On Apr 6, 2009, at 9:42 PM, Amar Kamat wrote:

David Kellogg wrote:
I am running Hadoop streaming. After around 42 jobs on an 18-node cluster, the jobtracker stops responding. This happens on normally- working code. Here are the symptoms.

1. A job is running, but it pauses with reduce stuck at XX%
2. "hadoop job -list" hangs or takes a very long time to return
3. In the Ganglia metrics on the Jobtracker node:
a. jvm.metrics__JobTracker__gcTimeMillis rises above 20 k (20 seconds) before failure b. jvm.metrics__JobTracker__memHeapUsedM rises above 600 before failure c. jvm.metrics__JobTracker__gcCount rises above 1 k before failure


The ticker looks like this.

09/04/06 03:06:28 INFO streaming.StreamJob:  map 24%  reduce 7%
09/04/06 03:13:44 INFO streaming.StreamJob:  map 25%  reduce 7%
After the 03:13:44 line, it hangs for more than 15 minutes.

In the jobtracker log, I see this.

2009-04-04 04:19:13,563 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-8143535428142072268_95993 failed because recovery from primary datanode 10.1.0.156:50010 failed 4 times. Will retry...

After restarting both dfs and mapreduce on all nodes, the problem goes away, and the formally non-working job proceeds without failure.
David,
What version are you using?
There can be because of :
1) Number of tasks in jobtracker's memory might exceed its limits. What is the total number of tasks in the jobtracker's memory? What is the jobtracker's heap size? Try increasing the heap size and also try setting the mapred.jobtracker.completeuserjobs.maximum parameter to some low value. 2) Sometimes some slow/bad datanode causes the jobtracker to get stuck. As you have mentioned this might be the cause. Can you let us know the output of 'kill -3' on jobtracker process.

Does anyone else see this problem?

David Kellogg


Full thread dump Java HotSpot(TM) 64-Bit Server VM (10.0-b22 mixed mode):

"LeaseChecker" daemon prio=10 tid=0x00007f0c3042a800 nid=0x3ef3 waiting on condition [0x0000000040bf7000..0x0000000040bf7b90]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.run (DFSClient.java:979)
        at java.lang.Thread.run(Thread.java:619)

"IPC Server handler 9 on 54311" daemon prio=10 tid=0x00007f0c307f0800 nid=0x3dd9 waiting on condition [0x000000004313f000..0x000000004313fd10]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007f0c3ca963a8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park (LockSupport.java:158) at java.util.concurrent.locks.AbstractQueuedSynchronizer $ConditionObject.await(AbstractQueuedSynchronizer.java:1925) at java.util.concurrent.LinkedBlockingQueue.take (LinkedBlockingQueue.java:358)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:878)

"IPC Server handler 8 on 54311" daemon prio=10 tid=0x00007f0c307ef400 nid=0x3dd8 waiting on condition [0x000000004303e000..0x000000004303ed90]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007f0c3ca963a8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park (LockSupport.java:158) at java.util.concurrent.locks.AbstractQueuedSynchronizer $ConditionObject.await(AbstractQueuedSynchronizer.java:1925) at java.util.concurrent.LinkedBlockingQueue.take (LinkedBlockingQueue.java:358)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:878)

"IPC Server handler 7 on 54311" daemon prio=10 tid=0x00007f0c307b0400 nid=0x3dd7 waiting on condition [0x0000000042f3d000..0x0000000042f3da10]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007f0c3ca963a8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park (LockSupport.java:158) at java.util.concurrent.locks.AbstractQueuedSynchronizer $ConditionObject.await(AbstractQueuedSynchronizer.java:1925) at java.util.concurrent.LinkedBlockingQueue.take (LinkedBlockingQueue.java:358)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:878)

"IPC Server handler 6 on 54311" daemon prio=10 tid=0x00007f0c307af000 nid=0x3dd6 waiting on condition [0x0000000042e3c000..0x0000000042e3ca90]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007f0c3ca963a8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park (LockSupport.java:158) at java.util.concurrent.locks.AbstractQueuedSynchronizer $ConditionObject.await(AbstractQueuedSynchronizer.java:1925) at java.util.concurrent.LinkedBlockingQueue.take (LinkedBlockingQueue.java:358)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:878)

"IPC Server handler 5 on 54311" daemon prio=10 tid=0x00007f0c30296800 nid=0x3dd5 waiting on condition [0x0000000042d3b000..0x0000000042d3bb10]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007f0c3ca963a8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park (LockSupport.java:158) at java.util.concurrent.locks.AbstractQueuedSynchronizer $ConditionObject.await(AbstractQueuedSynchronizer.java:1925) at java.util.concurrent.LinkedBlockingQueue.take (LinkedBlockingQueue.java:358)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:878)

"IPC Server handler 4 on 54311" daemon prio=10 tid=0x00007f0c302e7c00 nid=0x3dd4 waiting on condition [0x0000000042c3a000..0x0000000042c3ab90]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007f0c3ca963a8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park (LockSupport.java:158) at java.util.concurrent.locks.AbstractQueuedSynchronizer $ConditionObject.await(AbstractQueuedSynchronizer.java:1925) at java.util.concurrent.LinkedBlockingQueue.take (LinkedBlockingQueue.java:358)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:878)

"IPC Server handler 3 on 54311" daemon prio=10 tid=0x00007f0c302e6800 nid=0x3dd3 waiting on condition [0x0000000042b39000..0x0000000042b39c10]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007f0c3ca963a8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park (LockSupport.java:158) at java.util.concurrent.locks.AbstractQueuedSynchronizer $ConditionObject.await(AbstractQueuedSynchronizer.java:1925) at java.util.concurrent.LinkedBlockingQueue.take (LinkedBlockingQueue.java:358)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:878)

"IPC Server handler 2 on 54311" daemon prio=10 tid=0x00007f0c30337400 nid=0x3dd2 waiting on condition [0x0000000042a38000..0x0000000042a38c90]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007f0c3ca963a8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park (LockSupport.java:158) at java.util.concurrent.locks.AbstractQueuedSynchronizer $ConditionObject.await(AbstractQueuedSynchronizer.java:1925) at java.util.concurrent.LinkedBlockingQueue.take (LinkedBlockingQueue.java:358)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:878)

"IPC Server handler 1 on 54311" daemon prio=10 tid=0x00007f0c30336000 nid=0x3dd1 waiting on condition [0x0000000042937000..0x0000000042937d10]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007f0c3ca963a8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park (LockSupport.java:158) at java.util.concurrent.locks.AbstractQueuedSynchronizer $ConditionObject.await(AbstractQueuedSynchronizer.java:1925) at java.util.concurrent.LinkedBlockingQueue.take (LinkedBlockingQueue.java:358)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:878)

"IPC Server handler 0 on 54311" daemon prio=10 tid=0x00007f0c302a6000 nid=0x3dd0 waiting on condition [0x0000000042836000..0x0000000042836d90]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007f0c3ca963a8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park (LockSupport.java:158) at java.util.concurrent.locks.AbstractQueuedSynchronizer $ConditionObject.await(AbstractQueuedSynchronizer.java:1925) at java.util.concurrent.LinkedBlockingQueue.take (LinkedBlockingQueue.java:358)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:878)

"IPC Server listener on 54311" daemon prio=10 tid=0x00007f0c302b6c00 nid=0x3dcf runnable [0x0000000042735000..0x0000000042735a10]
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java: 215) at sun.nio.ch.EPollSelectorImpl.doSelect (EPollSelectorImpl.java:65) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java: 69)
        - locked <0x00007f0c3ca96ac8> (a sun.nio.ch.Util$1)
- locked <0x00007f0c3ca96ab0> (a java.util.Collections $UnmodifiableSet)
        - locked <0x00007f0c3ca96850> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:84)
        at org.apache.hadoop.ipc.Server$Listener.run(Server.java:297)

"IPC Server Responder" daemon prio=10 tid=0x00007f0c302b5800 nid=0x3dce runnable [0x0000000042634000..0x0000000042634a90]
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java: 215) at sun.nio.ch.EPollSelectorImpl.doSelect (EPollSelectorImpl.java:65) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java: 69)
        - locked <0x00007f0c3cba4810> (a sun.nio.ch.Util$1)
- locked <0x00007f0c3cba47f8> (a java.util.Collections $UnmodifiableSet)
        - locked <0x00007f0c3cba4520> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
        at org.apache.hadoop.ipc.Server$Responder.run(Server.java:456)

"expireLaunchingTasks" prio=10 tid=0x00007f0c308ba400 nid=0x3dcd waiting on condition [0x0000000042533000..0x0000000042533b10]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.mapred.JobTracker $ExpireLaunchingTasks.run(JobTracker.java:198)
        at java.lang.Thread.run(Thread.java:619)

"retireJobs" prio=10 tid=0x00007f0c303a8800 nid=0x3dcc waiting on condition [0x0000000042432000..0x0000000042432b90]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.mapred.JobTracker$RetireJobs.run (JobTracker.java:351)
        at java.lang.Thread.run(Thread.java:619)

"expireTrackers" prio=10 tid=0x00007f0c36940c00 nid=0x3dcb waiting on condition [0x0000000041c7f000..0x0000000041c7fc10]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.mapred.JobTracker$ExpireTrackers.run (JobTracker.java:281)
        at java.lang.Thread.run(Thread.java:619)

"initJobs" prio=10 tid=0x00007f0c3693fc00 nid=0x3dca in Object.wait() [0x0000000041b7e000..0x0000000041b7ec90]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x00007f0c3ca91d90> (a java.util.ArrayList)
        at java.lang.Object.wait(Object.java:485)
at org.apache.hadoop.mapred.EagerTaskInitializationListener $JobInitThread.run(EagerTaskInitializationListener.java:51)
        - locked <0x00007f0c3ca91d90> (a java.util.ArrayList)
        at java.lang.Thread.run(Thread.java:619)

"Thread-14" prio=10 tid=0x00007f0c3602f400 nid=0x3dc9 waiting on condition [0x0000000041a7d000..0x0000000041a7dd10]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007f0c3cf03938> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park (LockSupport.java:158) at java.util.concurrent.locks.AbstractQueuedSynchronizer $ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
        at java.util.concurrent.DelayQueue.take(DelayQueue.java:160)
        at java.util.concurrent.DelayQueue.take(DelayQueue.java:39)
at org.apache.hadoop.mapred.JobEndNotifier$1.run (JobEndNotifier.java:50)
        at java.lang.Thread.run(Thread.java:619)

"Timer thread for monitoring mapred" daemon prio=10 tid=0x00007f0c36d51000 nid=0x3dc4 in Object.wait() [0x000000004197c000..0x000000004197cb10]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.util.TimerThread.mainLoop(Timer.java:509)
        - locked <0x00007f0c3ceabad8> (a java.util.TaskQueue)
        at java.util.TimerThread.run(Timer.java:462)

"Timer thread for monitoring jvm" daemon prio=10 tid=0x00007f0c36d86000 nid=0x3dc3 in Object.wait() [0x00000000407b3000..0x00000000407b3b90]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.util.TimerThread.mainLoop(Timer.java:509)
        - locked <0x00007f0c3ceaeaf8> (a java.util.TaskQueue)
        at java.util.TimerThread.run(Timer.java:462)

"SocketListener0-1" prio=10 tid=0x00007f0c36a4f000 nid=0x3dc2 in Object.wait() [0x000000004187b000..0x000000004187bc10]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
at org.mortbay.util.ThreadPool$PoolThread.run (ThreadPool.java:522) - locked <0x00007f0c3ceb1760> (a org.mortbay.util.ThreadPool $PoolThread)

"SocketListener0-0" prio=10 tid=0x00007f0c36d69400 nid=0x3dc1 in Object.wait() [0x000000004177a000..0x000000004177ac90]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
at org.mortbay.util.ThreadPool$PoolThread.run (ThreadPool.java:522) - locked <0x00007f0c3ceb1a40> (a org.mortbay.util.ThreadPool $PoolThread)

"Acceptor ServerSocket[addr=0.0.0.0/0.0.0.0,port=0,localport=50030]" prio=10 tid=0x00007f0c36b6a800 nid=0x3dc0 runnable [0x0000000041679000..0x0000000041679d10]
   java.lang.Thread.State: RUNNABLE
        at java.net.PlainSocketImpl.socketAccept(Native Method)
        at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:384)
        - locked <0x00007f0c3cc3b5c8> (a java.net.SocksSocketImpl)
        at java.net.ServerSocket.implAccept(ServerSocket.java:453)
        at java.net.ServerSocket.accept(ServerSocket.java:421)
at org.mortbay.util.ThreadedServer.acceptSocket (ThreadedServer.java:432) at org.mortbay.util.ThreadedServer$Acceptor.run (ThreadedServer.java:631)

"SessionScavenger" daemon prio=10 tid=0x00007f0c303cc000 nid=0x3dbf waiting on condition [0x0000000040e0b000..0x0000000040e0bd90]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
at org.mortbay.jetty.servlet.AbstractSessionManager $SessionScavenger.run(AbstractSessionManager.java:587)

"SessionScavenger" daemon prio=10 tid=0x00007f0c30346800 nid=0x3dbe waiting on condition [0x0000000041de8000..0x0000000041de8a10]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
at org.mortbay.jetty.servlet.AbstractSessionManager $SessionScavenger.run(AbstractSessionManager.java:587)

"SessionScavenger" daemon prio=10 tid=0x00007f0c3031e400 nid=0x3dbd waiting on condition [0x0000000041578000..0x0000000041578a90]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
at org.mortbay.jetty.servlet.AbstractSessionManager $SessionScavenger.run(AbstractSessionManager.java:587)

"Low Memory Detector" daemon prio=10 tid=0x00000000401c9400 nid=0x3d7c runnable [0x0000000000000000..0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"CompilerThread1" daemon prio=10 tid=0x00000000401c6000 nid=0x3d7b waiting on condition [0x0000000000000000..0x00000000423302b0]
   java.lang.Thread.State: RUNNABLE

"CompilerThread0" daemon prio=10 tid=0x00000000401c4000 nid=0x3d7a waiting on condition [0x0000000000000000..0x000000004222f330]
   java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" daemon prio=10 tid=0x00000000401c2800 nid=0x3d79 waiting on condition [0x0000000000000000..0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Finalizer" daemon prio=10 tid=0x000000004019d800 nid=0x3d78 in Object.wait() [0x000000004202e000..0x000000004202eb90]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
- locked <0x00007f0c3cbfe4e0> (a java.lang.ref.ReferenceQueue $Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
at java.lang.ref.Finalizer$FinalizerThread.run (Finalizer.java:159)

"Reference Handler" daemon prio=10 tid=0x0000000040196c00 nid=0x3d77 in Object.wait() [0x0000000041f2d000..0x0000000041f2dc10]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:485)
at java.lang.ref.Reference$ReferenceHandler.run (Reference.java:116)
        - locked <0x00007f0c3caf1408> (a java.lang.ref.Reference$Lock)

"main" prio=10 tid=0x0000000040111c00 nid=0x3d73 in Object.wait() [0x0000000041477000..0x0000000041477ed0]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
- waiting on <0x00007f0c3cba4020> (a org.apache.hadoop.ipc.RPC$Server)
        at java.lang.Object.wait(Object.java:485)
        at org.apache.hadoop.ipc.Server.join(Server.java:1021)
- locked <0x00007f0c3cba4020> (a org.apache.hadoop.ipc.RPC $Server) at org.apache.hadoop.mapred.JobTracker.offerService (JobTracker.java:1309) at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java: 2731)

"VM Thread" prio=10 tid=0x0000000040191c00 nid=0x3d76 runnable

"GC task thread#0 (ParallelGC)" prio=10 tid=0x000000004011c400 nid=0x3d74 runnable

"GC task thread#1 (ParallelGC)" prio=10 tid=0x000000004011d800 nid=0x3d75 runnable

"VM Periodic Task Thread" prio=10 tid=0x00000000401cb000 nid=0x3d7d waiting on condition

JNI global references: 964

Heap
PSYoungGen total 198528K, used 68487K [0x00007f0c66540000, 0x00007f0c77830000, 0x00007f0c7b290000) eden space 142144K, 48% used [0x00007f0c66540000,0x00007f0c6a821f70,0x00007f0c6f010000) from space 56384K, 0% used [0x00007f0c6f010000,0x00007f0c6f010000,0x00007f0c72720000) to space 57600K, 0% used [0x00007f0c73ff0000,0x00007f0c73ff0000,0x00007f0c77830000) PSOldGen total 682688K, used 672276K [0x00007f0c3ca90000, 0x00007f0c66540000, 0x00007f0c66540000) object space 682688K, 98% used [0x00007f0c3ca90000,0x00007f0c65b15068,0x00007f0c66540000) PSPermGen total 21248K, used 16555K [0x00007f0c37690000, 0x00007f0c38b50000, 0x00007f0c3ca90000) object space 21248K, 77% used [0x00007f0c37690000,0x00007f0c386bae60,0x00007f0c38b50000)

Reply via email to