Hi hadoop folks,
i'm encountering following problem on a 4 node cluster running hadoop-0.20.2.
Its a map only job reading about 9 GB data from outside of hadoop. 31 map tasks
at all while 12 map tasks running at a time.
The first wave of mappers finishes successfully.
Later on the first tasks are failing and they do that shortly before finishing:
Task attempt_201005061210_0002_m_000001_0 failed to report status for 602
seconds. Killing!
Task attempt_201005061210_0002_m_000001_1 failed to report status for 600
seconds. Killing!
Task attempt_201005061210_0002_m_000001_2 failed to report status for 602
seconds. Killing!
The unusual is that i do not find any signs of the job code in the thread dump
the tasktracker takes automatically:
----------------------------------------------------------------
2010-05-05 00:59:03,515 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201004301437_0050_m_000001_0: Task attempt_201004301437_0050_m_000001_0
failed to report status for 601 seconds. Killing!
2010-05-05 00:59:03,520 INFO org.apache.hadoop.mapred.TaskTracker: Process
Thread Dump: lost task
34 active threads
Thread 29555 (process reaper):
State: RUNNABLE
Blocked count: 0
Waited count: 0
Stack:
java.lang.UNIXProcess.waitForProcessExit(Native Method)
java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
Thread 29554 (JVM Runner jvm_201004301437_0050_m_1465855495 spawned.):
State: WAITING
Blocked count: 1
Waited count: 2
Waiting on java.lang.unixproc...@34b56f69
Stack:
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:485)
java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
org.apache.hadoop.util.Shell.runCommand(Shell.java:186)
org.apache.hadoop.util.Shell.run(Shell.java:134)
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:335)
org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:324)
Thread 29550 (Thread-15193):
State: WAITING
Blocked count: 13
Waited count: 14
Waiting on java.lang.obj...@72c09161
Stack:
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:485)
org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:409)
Thread 29501 (process reaper):
State: RUNNABLE
Blocked count: 0
Waited count: 0
Stack:
java.lang.UNIXProcess.waitForProcessExit(Native Method)
java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
Thread 29500 (JVM Runner jvm_201004301437_0050_m_861348535 spawned.):
State: WAITING
Blocked count: 1
Waited count: 2
Waiting on java.lang.unixproc...@56175cd4
Stack:
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:485)
java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
org.apache.hadoop.util.Shell.runCommand(Shell.java:186)
org.apache.hadoop.util.Shell.run(Shell.java:134)
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:335)
org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:324)
Thread 29496 (Thread-15162):
State: WAITING
Blocked count: 13
Waited count: 14
Waiting on java.lang.obj...@3b916de2
Stack:
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:485)
org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:409)
Thread 29225 (process reaper):
State: RUNNABLE
Blocked count: 0
Waited count: 0
Stack:
java.lang.UNIXProcess.waitForProcessExit(Native Method)
java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
Thread 29224 (JVM Runner jvm_201004301437_0050_m_-205889314 spawned.):
State: WAITING
Blocked count: 1
Waited count: 2
Waiting on java.lang.unixproc...@39874d3b
Stack:
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:485)
java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
org.apache.hadoop.util.Shell.runCommand(Shell.java:186)
org.apache.hadoop.util.Shell.run(Shell.java:134)
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:335)
org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:324)
Thread 29217 (process reaper):
State: RUNNABLE
Blocked count: 0
Waited count: 0
Stack:
java.lang.UNIXProcess.waitForProcessExit(Native Method)
java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
Thread 29215 (JVM Runner jvm_201004301437_0050_m_-1797469636 spawned.):
State: WAITING
Blocked count: 1
Waited count: 2
Waiting on java.lang.unixproc...@2c39220f
Stack:
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:485)
java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
org.apache.hadoop.util.Shell.runCommand(Shell.java:186)
org.apache.hadoop.util.Shell.run(Shell.java:134)
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:335)
org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:324)
Thread 29208 (Thread-15014):
State: WAITING
Blocked count: 18
Waited count: 14
Waiting on java.lang.obj...@7ddbbae2
Stack:
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:485)
org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:409)
Thread 29203 (Thread-15011):
State: WAITING
Blocked count: 13
Waited count: 14
Waiting on java.lang.obj...@2dac3f6f
Stack:
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:485)
org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:409)
Thread 26714 (IPC Client (47) connection to /10.0.11.64:9001 from hadoop):
State: TIMED_WAITING
Blocked count: 10075
Waited count: 10075
Stack:
java.lang.Object.wait(Native Method)
org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:403)
org.apache.hadoop.ipc.Client$Connection.run(Client.java:445)
Thread 33 (Directory/File cleanup thread):
State: WAITING
Blocked count: 2
Waited count: 615
Waiting on
java.util.concurrent.locks.abstractqueuedsynchronizer$conditionobj...@4d6c68b3
Stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
org.apache.hadoop.mapred.CleanupQueue$PathCleanupThread.run(CleanupQueue.java:89)
Thread 9 (taskCleanup):
State: WAITING
Blocked count: 30
Waited count: 92
Waiting on
java.util.concurrent.locks.abstractqueuedsynchronizer$conditionobj...@3298407f
Stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
org.apache.hadoop.mapred.TaskTracker$1.run(TaskTracker.java:317)
java.lang.Thread.run(Thread.java:619)
Thread 32 (TaskLauncher for task):
State: WAITING
Blocked count: 127
Waited count: 123
Waiting on java.util.linkedl...@c33377
Stack:
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:485)
org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:1601)
Thread 31 (TaskLauncher for task):
State: WAITING
Blocked count: 3237
Waited count: 3147
Waiting on java.util.linkedl...@67001629
Stack:
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:485)
org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:1601)
Thread 30 (Map-events fetcher for all reduce tasks on
tracker_node1t:localhost/127.0.0.1:48722):
State: WAITING
Blocked count: 712
Waited count: 804
Waiting on java.util.tree...@1fec8cf1
Stack:
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:485)
org.apache.hadoop.mapred.TaskTracker$MapEventsFetcherThread.run(TaskTracker.java:606)
Thread 28 (IPC Server handler 7 on 48722):
State: WAITING
Blocked count: 10
Waited count: 3199
Waiting on
java.util.concurrent.locks.abstractqueuedsynchronizer$conditionobj...@b4848ae
Stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
Thread 27 (IPC Server handler 6 on 48722):
State: WAITING
Blocked count: 265
Waited count: 3207
Waiting on
java.util.concurrent.locks.abstractqueuedsynchronizer$conditionobj...@b4848ae
Stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
Thread 26 (IPC Server handler 5 on 48722):
State: WAITING
Blocked count: 296
Waited count: 3206
Waiting on
java.util.concurrent.locks.abstractqueuedsynchronizer$conditionobj...@b4848ae
Stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
Thread 25 (IPC Server handler 4 on 48722):
State: WAITING
Blocked count: 199
Waited count: 3202
Waiting on
java.util.concurrent.locks.abstractqueuedsynchronizer$conditionobj...@b4848ae
Stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
Thread 24 (IPC Server handler 3 on 48722):
State: WAITING
Blocked count: 216
Waited count: 3201
Waiting on
java.util.concurrent.locks.abstractqueuedsynchronizer$conditionobj...@b4848ae
Stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
Thread 23 (IPC Server handler 2 on 48722):
State: WAITING
Blocked count: 119
Waited count: 3199
Waiting on
java.util.concurrent.locks.abstractqueuedsynchronizer$conditionobj...@b4848ae
Stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
Thread 22 (IPC Server handler 1 on 48722):
State: WAITING
Blocked count: 193
Waited count: 3207
Waiting on
java.util.concurrent.locks.abstractqueuedsynchronizer$conditionobj...@b4848ae
Stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
Thread 21 (IPC Server handler 0 on 48722):
State: WAITING
Blocked count: 285
Waited count: 3203
Waiting on
java.util.concurrent.locks.abstractqueuedsynchronizer$conditionobj...@b4848ae
Stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
org.apache.hadoop.ipc.Server$Handler.run(Server.java:939)
Thread 18 (IPC Server listener on 48722):
State: RUNNABLE
Blocked count: 0
Waited count: 0
Stack:
sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
sun.nio.ch.SelectorImpl.select(SelectorImpl.java:84)
org.apache.hadoop.ipc.Server$Listener.run(Server.java:318)
Thread 20 (IPC Server Responder):
State: RUNNABLE
Blocked count: 0
Waited count: 0
Stack:
sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
org.apache.hadoop.ipc.Server$Responder.run(Server.java:478)
Thread 17 (Timer-0):
State: TIMED_WAITING
Blocked count: 1
Waited count: 12764
Stack:
java.lang.Object.wait(Native Method)
java.util.TimerThread.mainLoop(Timer.java:509)
java.util.TimerThread.run(Timer.java:462)
Thread 16 (295726...@qtp0-0 - Acceptor0 [email protected]:50060):
State: RUNNABLE
Blocked count: 1310
Waited count: 1
Stack:
sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
org.mortbay.io.nio.SelectorManager$SelectSet.doSelect(SelectorManager.java:429)
org.mortbay.io.nio.SelectorManager.doSelect(SelectorManager.java:185)
org.mortbay.jetty.nio.SelectChannelConnector.accept(SelectChannelConnector.java:124)
org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:707)
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)
Thread 4 (Signal Dispatcher):
State: RUNNABLE
Blocked count: 0
Waited count: 0
Stack:
Thread 3 (Finalizer):
State: WAITING
Blocked count: 1337
Waited count: 1336
Waiting on java.lang.ref.referencequeue$l...@5e77533b
Stack:
java.lang.Object.wait(Native Method)
java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
Thread 2 (Reference Handler):
State: WAITING
Blocked count: 1544
Waited count: 1336
Waiting on java.lang.ref.reference$l...@46efbdf1
Stack:
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:485)
java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
Thread 1 (main):
State: RUNNABLE
Blocked count: 152594
Waited count: 279661
Stack:
sun.management.ThreadImpl.getThreadInfo0(Native Method)
sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:147)
sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:123)
org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:149)
org.apache.hadoop.util.ReflectionUtils.logThreadInfo(ReflectionUtils.java:203)
org.apache.hadoop.mapred.TaskTracker.markUnresponsiveTasks(TaskTracker.java:1323)
org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:1106)
org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1720)
org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2833)
2010-05-05 00:59:03,520 INFO org.apache.hadoop.mapred.TaskTracker: About to
purge task: attempt_201004301437_0050_m_000001_0
2010-05-05 00:59:03,520 INFO org.apache.hadoop.mapred.TaskTracker: addFreeSlot
: current free slots : 1
2010-05-05 00:59:03,520 INFO org.apache.hadoop.mapred.TaskRunner:
attempt_201004301437_0050_m_000001_0 done; removing files.
2010-05-05 00:59:03,521 INFO org.apache.hadoop.mapred.IndexCache: Map ID
attempt_201004301437_0050_m_000001_0 not found in cache
2010-05-05 00:59:05,012 INFO org.apache.hadoop.mapred.JvmManager: JVM :
jvm_201004301437_0050_m_-205889314 exited. Number of tasks it ran: 0
2010-05-05 00:59:06,521 INFO org.apache.hadoop.mapred.TaskRunner:
attempt_201004301437_0050_m_000001_0 done; removing files.
2010-05-05 00:59:06,637 INFO org.apache.hadoop.mapred.TaskTracker:
LaunchTaskAction (registerTask): attempt_201004301437_0050_m_000001_0 task's
state:FAILED_UNCLEAN
2010-05-05 00:59:06,637 INFO org.apache.hadoop.mapred.TaskTracker: Trying to
launch : attempt_201004301437_0050_m_000001_0
2010-05-05 00:59:06,637 INFO org.apache.hadoop.mapred.TaskTracker: In
TaskLauncher, current free slots : 1 and trying to launch
attempt_201004301437_0050_m_000001_0
2010-05-05 00:59:08,606 INFO org.apache.hadoop.mapred.TaskTracker: JVM with ID:
jvm_201004301437_0050_m_-1095145752 given task:
attempt_201004301437_0050_m_000001_0
2010-05-05 00:59:09,222 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201004301437_0050_m_000001_0 0.0%
2010-05-05 00:59:09,560 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201004301437_0050_m_000001_0 0.0% cleanup
2010-05-05 00:59:09,561 INFO org.apache.hadoop.mapred.TaskTracker: Task
attempt_201004301437_0050_m_000001_0 is done.
2010-05-05 00:59:09,561 INFO org.apache.hadoop.mapred.TaskTracker: reported
output size for attempt_201004301437_0050_m_000001_0 was 0
2010-05-05 00:59:09,561 INFO org.apache.hadoop.mapred.TaskRunner:
attempt_201004301437_0050_m_000001_0 done; removing files.
----------------------------------------------------------------
The syslog of the attempt just looks like:
----------------------------------------------------------------
2010-05-04 23:48:03,365 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=MAP, sessionId=
2010-05-04 23:48:03,674 WARN datameer.dap.sdk.util.ManifestMetaData: Failed to
get version from the manifest file of jar
'file:/data/drive0/mapred/tmp/taskTracker/archive/nurago64.local/das/jobjars/61a2133b7887b9be48d42e49002c85e0/stripped-dap-0.24.dev-job.jar/stripped-dap-0.24.dev-job.jar'
: null
2010-05-04 23:48:04,691 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 0
2010-05-04 23:48:04,872 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded
the native-hadoop library
2010-05-04 23:48:04,873 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory:
Successfully loaded & initialized native-zlib library
2010-05-04 23:48:04,874 INFO org.apache.hadoop.io.compress.CodecPool: Got
brand-new compressor
2010-05-05 00:59:08,753 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=CLEANUP, sessionId=
2010-05-05 00:59:09,223 INFO org.apache.hadoop.mapred.TaskRunner: Runnning
cleanup for the task
2010-05-05 00:59:09,442 INFO org.apache.hadoop.mapred.TaskRunner:
Task:attempt_201004301437_0050_m_000001_0 is done. And is in the process of
commiting
2010-05-05 00:59:09,562 INFO org.apache.hadoop.mapred.TaskRunner: Task
'attempt_201004301437_0050_m_000001_0' done.
----------------------------------------------------------------
Any ideas ?
best regards
Johannes