Re: What are the conditions or which is the status of re-scheduling feature for failed attempts caused by dying a node?

Koji Noguchi Wed, 02 Feb 2011 16:16:10 -0800

Hi Tibor,

I¹m afraid I don¹t have time to go over the logs you¹ve attached but


>  just the node failure interfered with another problem, which may be marke by
"User-directories for the user ec2-user are already initialized on this TT. Not
doing anything."
>
This log was marked as INFO.  Usually that means this is expected and
nothing to worry about.
Log is saying this user has ran some tasks before on this TaskTracker node
and it already has the temp-user/job dir setup so no additional work is
needed (for initializeUserDirs method).

BTW, setting a long timeout for tasks is always a bad idea.
It would be good if you can modify your application so that it would report
progress and set back the timeout to close to 10 mins.

Koji



On 2/1/11 2:07 PM, "Kiss Tibor" <kiss.ti...@gmail.com> wrote:

> Thank you Koji,
>
> I don't think that this case has something with CDH only. (I usually run both
> versions of clusters (Apache and CDH) using Whirr.)
>
> I think I extracted the most interesting part of the events.
> I identified the first attempt which failed 4 times. (Now I noticed that this
> 4 times failing is the precondition to fail the job too. I am correct?)
> Then saved the most interesting logs from the tasktracker node where it failed
> first time and the logs from namenode+jobtracker instance.
>
> In short here is what has happened with that attemt.
> 2011-01-28 13:45:03,741   was the job initialization time
> ....
> Those four attempts were scheduled to the following trackers:
> 2011-01-28 19:46:15,153
> tracker_ip-10-116-247-252.ec2.internal:localhost/127.0.0.1:54196
> <http://127.0.0.1:54196>
> 2011-01-29 00:46:51,376
> tracker_ip-10-116-186-223.ec2.internal:localhost/127.0.0.1:37193
> <http://127.0.0.1:37193>
> 2011-01-29 05:47:55,312
> tracker_ip-10-114-209-180.ec2.internal:localhost/127.0.0.1:52557
> <http://127.0.0.1:52557>
> 2011-01-29 10:49:35,833
> tracker_ip-10-100-205-180.ec2.internal:localhost/127.0.0.1:45500
> <http://127.0.0.1:45500>
> The tracker_ip-10-100-205-180.ec2.internal instance (where last two attempts
> were scheduled) died hardly that I couldn't connect to with ssh anymore.
>
> The high interval between retries is due to a unusually very large timeout
> set. (We are running some statistical algorithms on a relatively large split
> size of biological data and the running time for a task is heavily dependent
> on the content, that's why we choosed to do some measurements to guess the
> optimal splitsize and running time distribution. We noticed that 30minutes
> wasn't enough for some tasks, therefore we increased to an amount in which all
> the tasks will fit in and deciding later how to reduce the split size that
> running time to be efficient.)
>
> After the failed node it was marked dead, I rerun the job with the same data
> and the job has been finished in aprox 2 hours.
> (The algoritm is repeatable, does not have randoms to determine different
> running times when rerunning.)
> Looks like that before the node died, we had a severe slowdown. The real cause
> of the slowdown it may be the EC2 instance which finally died, unfortunately
> we didn't monitored in detail to catch such a slowdown. Is it possible that
> such a slowdown on a node to has such a huge impact over the entire job
> running on 8 workers?
>
> Looking into the tasktracker log of the first attempt of failed task,
> searching for the cause of the failure, I see two different things.
> One of them is at the beginning
> ----- hadoop-hadoop-tasktracker-ip-10-116-247-252.log.2011-01-28  -----
> 2011-01-28 19:46:15,153 INFO org.apache.hadoop.mapred.TaskTracker:
> LaunchTaskAction (registerTask): attempt_201101261418_0021_m_000193_0 task's
> state:UNASSIGNED
> 2011-01-28 19:46:15,153 INFO org.apache.hadoop.mapred.TaskTracker: Trying to
> launch : attempt_201101261418_0021_m_000193_0 which needs 1 slots
> 2011-01-28 19:46:15,153 INFO org.apache.hadoop.mapred.TaskTracker: In
> TaskLauncher, current free slots : 1 and trying to launch
> attempt_201101261418_0021_m_000193_0 which needs 1 slots
> 2011-01-28 19:46:15,153 INFO
> org.apache.hadoop.mapreduce.server.tasktracker.Localizer: User-directories for
> the user ec2-user are already initialized on this TT. Not doing anything.
> 2011-01-28 19:46:15,169 INFO org.apache.hadoop.mapred.JvmManager: Killing JVM:
> jvm_201101261418_0021_m_456866840
>   (Note that the same JVM killing cause I see it several times with different
> attempts.)
> Then at the timeout:
> ----- hadoop-hadoop-tasktracker-ip-10-116-247-252.log.2011-01-28  -----
> 2011-01-29 00:46:31,011 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201101261418_0021_m_000193_0: Task
> attempt_201101261418_0021_m_000193_0 failed to report status for 18002
> seconds. Killing!
> 2011-01-29 00:46:31,039 INFO org.apache.hadoop.mapred.TaskTracker: Process
> Thread Dump: lost task
> 43 active threads..... (here I have a long stacktrace, see below)
>
> Is it very interesting the "User-directories for the user ec2-user are already
> initialized on this TT. Not doing anything." After doing nothing, maybe makes
> some sense a timeout even if it was happening on a 5 hours limit.
>
> Another thing is that it may happen to an attempt to be rescheduled 4'th time
> to a failing node. In that case the job failover mechanism cannot be effective
> anymore. At least I know that not the job failover mechanism is buggy, just
> the node failure interfered with another problem, which may be marke by
> "User-directories for the user ec2-user are already initialized on this TT.
> Not doing anything."
>
>
> Here are the log extracts in detail:
> ------------------jobtracker log----------------------------
> 2011-01-28 19:46:15,153 INFO org.apache.hadoop.mapred.JobInProgress: Choosing
> a non-local task task_201101261418_0021_m_000193
> 2011-01-28 19:46:15,153 INFO org.apache.hadoop.mapred.JobTracker: Adding task
> (MAP) 'attempt_201101261418_0021_m_000193_0' to tip
> task_201101261418_0021_m_000193, for tracker
> 'tracker_ip-10-116-247-252.ec2.internal:localhost/127.0.0.1:54196
> <http://127.0.0.1:54196> '
>
> ----- hadoop-hadoop-tasktracker-ip-10-116-247-252.log.2011-01-28  -----
> 2011-01-28 19:46:15,153 INFO org.apache.hadoop.mapred.TaskTracker:
> LaunchTaskAction (registerTask): attempt_201101261418_0021_m_000193_0 task's
> state:UNASSIGNED
> 2011-01-28 19:46:15,153 INFO org.apache.hadoop.mapred.TaskTracker: Trying to
> launch : attempt_201101261418_0021_m_000193_0 which needs 1 slots
> 2011-01-28 19:46:15,153 INFO org.apache.hadoop.mapred.TaskTracker: In
> TaskLauncher, current free slots : 1 and trying to launch
> attempt_201101261418_0021_m_000193_0 which needs 1 slots
> 2011-01-28 19:46:15,153 INFO
> org.apache.hadoop.mapreduce.server.tasktracker.Localizer: User-directories for
> the user ec2-user are already initialized on this TT. Not doing anything.
> 2011-01-28 19:46:15,169 INFO org.apache.hadoop.mapred.JvmManager: Killing JVM:
> jvm_201101261418_0021_m_456866840
>   (Note that the same JVM killing cause I see it several times with different
> attempts.)
>
>
> ------------------jobtracker log----------------------------
> 2011-01-29 00:46:36,306 INFO org.apache.hadoop.mapred.TaskInProgress: Error
> from attempt_201101261418_0021_m_000193_0: Task
> attempt_201101261418_0021_m_000193_0 failed to report status for 1
> 8002 seconds. Killing!
> 2011-01-29 00:46:36,306 INFO org.apache.hadoop.mapred.JobTracker: Removing
> task 'attempt_201101261418_0021_m_000193_0'
> 2011-01-29 00:46:36,306 INFO org.apache.hadoop.mapred.TaskInProgress: Error
> from attempt_201101261418_0021_m_000193_0: Task
> attempt_201101261418_0021_m_000193_0 failed to report status for 1
> 8002 seconds. Killing!
> 2011-01-29 00:46:36,306 INFO org.apache.hadoop.mapred.JobTracker: Removing
> task 'attempt_201101261418_0021_m_000193_0'
> 2011-01-29 00:46:36,307 INFO org.apache.hadoop.mapred.JobTracker: Adding task
> (TASK_CLEANUP) 'attempt_201101261418_0021_m_000193_0' to tip
> task_201101261418_0021_m_000193, for tracker
> 'tracker_ip-10-116-247-252.ec2.internal:localhost/127.0.0.1:54196
> <http://127.0.0.1:54196> '
> 2011-01-29 00:46:51,376 INFO org.apache.hadoop.mapred.JobInProgress: Choosing
> a non-local task task_201101261418_0021_m_000193
> 2011-01-29 00:46:51,376 INFO org.apache.hadoop.mapred.JobTracker: Adding task
> (MAP) 'attempt_201101261418_0021_m_000193_1' to tip
> task_201101261418_0021_m_000193, for tracker
> 'tracker_ip-10-116-186-223.ec2.internal:localhost/127.0.0.1:37193
> <http://127.0.0.1:37193> '
>
> ----- hadoop-hadoop-tasktracker-ip-10-116-247-252.log.2011-01-28  -----
> 2011-01-29 00:46:31,011 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201101261418_0021_m_000193_0: Task
> attempt_201101261418_0021_m_000193_0 failed to report status for 18002
> seconds. Killing!
> 2011-01-29 00:46:31,039 INFO org.apache.hadoop.mapred.TaskTracker: Process
> Thread Dump: lost task
> 43 active threads
> Thread 17976 (process reaper):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 0
>   Stack:
>     java.lang.UNIXProcess.waitForProcessExit(Native Method)
>     java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
>     java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
> Thread 17975 (JVM Runner jvm_201101261418_0021_m_-646088387 spawned.):
>   State: WAITING
>   Blocked count: 1
>   Waited count: 2
>   Waiting on java.lang.UNIXProcess@19570aa2
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.Object.wait(Object.java:485)
>     java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
>     org.apache.hadoop.util.Shell.runCommand(Shell.java:245)
>     org.apache.hadoop.util.Shell.run(Shell.java:177)
>     org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:376)
>
> org.apache.hadoop.mapred.DefaultTaskController.launchTaskJVM(DefaultTaskContro
> ller.java:70)
>
> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmMa
> nager.java:450)
>
> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager
> .java:439)
> Thread 17947 (Thread-9955):
>   State: WAITING
>   Blocked count: 4
>   Waited count: 6
>   Waiting on java.lang.Object@217b1ead
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.Object.wait(Object.java:485)
>     org.apache.hadoop.mapred.TaskRunner.launchJvmAndWait(TaskRunner.java:260)
>     org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:208)
> Thread 17924 (process reaper):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 0
>   Stack:
>     java.lang.UNIXProcess.waitForProcessExit(Native Method)
>     java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
>     java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
> Thread 17922 (JVM Runner jvm_201101261418_0021_m_913485741 spawned.):
>   State: WAITING
>   Blocked count: 1
>   Waited count: 2
>   Waiting on java.lang.UNIXProcess@118b3420
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.Object.wait(Object.java:485)
>     java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
>     org.apache.hadoop.util.Shell.runCommand(Shell.java:245)
>     org.apache.hadoop.util.Shell.run(Shell.java:177)
>     org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:376)
>
> org.apache.hadoop.mapred.DefaultTaskController.launchTaskJVM(DefaultTaskContro
> ller.java:70)
>
> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmMa
> nager.java:450)
>
> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager
> .java:439)
> Thread 17906 (Thread-9934):
>   State: WAITING
>   Blocked count: 1
>   Waited count: 1
>   Waiting on java.lang.Object@4a0b29df
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.Object.wait(Object.java:485)
>     org.apache.hadoop.mapred.TaskRunner.launchJvmAndWait(TaskRunner.java:260)
>     org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:208)
> Thread 17889 (process reaper):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 0
>   Stack:
>     java.lang.UNIXProcess.waitForProcessExit(Native Method)
>     java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
>     java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
> Thread 17888 (JVM Runner jvm_201101261418_0021_m_-2031309830 spawned.):
>   State: WAITING
>   Blocked count: 1
>   Waited count: 2
>   Waiting on java.lang.UNIXProcess@762b9c2d
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.Object.wait(Object.java:485)
>     java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
>     org.apache.hadoop.util.Shell.runCommand(Shell.java:245)
>     org.apache.hadoop.util.Shell.run(Shell.java:177)
>     org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:376)
>
> org.apache.hadoop.mapred.DefaultTaskController.launchTaskJVM(DefaultTaskContro
> ller.java:70)
>
> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmMa
> nager.java:450)
>
> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager
> .java:439)
> Thread 17871 (Thread-9914):
>   State: WAITING
>   Blocked count: 1
>   Waited count: 1
>   Waiting on java.lang.Object@291a8af7
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.Object.wait(Object.java:485)
>     org.apache.hadoop.mapred.TaskRunner.launchJvmAndWait(TaskRunner.java:260)
>     org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:208)
> Thread 17283 (process reaper):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 0
>   Stack:
>     java.lang.UNIXProcess.waitForProcessExit(Native Method)
>     java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
>     java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
> Thread 17282 (JVM Runner jvm_201101261418_0021_m_811927398 spawned.):
>   State: WAITING
>   Blocked count: 1
>   Waited count: 2
>   Waiting on java.lang.UNIXProcess@5ff6c2a3
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.Object.wait(Object.java:485)
>     java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
>     org.apache.hadoop.util.Shell.runCommand(Shell.java:245)
>     org.apache.hadoop.util.Shell.run(Shell.java:177)
>     org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:376)
>
> org.apache.hadoop.mapred.DefaultTaskController.launchTaskJVM(DefaultTaskContro
> ller.java:70)
>
> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmMa
> nager.java:450)
>
> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager
> .java:439)
> Thread 17248 (Thread-9561):
>   State: WAITING
>   Blocked count: 1
>   Waited count: 1
>   Waiting on java.lang.Object@41648c8d
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.Object.wait(Object.java:485)
>     org.apache.hadoop.mapred.TaskRunner.launchJvmAndWait(TaskRunner.java:260)
>     org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:208)
> Thread 16493 (process reaper):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 0
>   Stack:
>     java.lang.UNIXProcess.waitForProcessExit(Native Method)
>     java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
>     java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
> Thread 16492 (JVM Runner jvm_201101261418_0021_r_1210798394 spawned.):
>   State: WAITING
>   Blocked count: 1
>   Waited count: 2
>   Waiting on java.lang.UNIXProcess@7994641d
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.Object.wait(Object.java:485)
>     java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
>     org.apache.hadoop.util.Shell.runCommand(Shell.java:245)
>     org.apache.hadoop.util.Shell.run(Shell.java:177)
>     org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:376)
>
> org.apache.hadoop.mapred.DefaultTaskController.launchTaskJVM(DefaultTaskContro
> ller.java:70)
>
> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmMa
> nager.java:450)
>
> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager
> .java:439)
> Thread 16491 (Thread-9136):
>   State: WAITING
>   Blocked count: 0
>   Waited count: 1
>   Waiting on java.lang.Object@7d38ea46
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.Object.wait(Object.java:485)
>     org.apache.hadoop.mapred.TaskRunner.launchJvmAndWait(TaskRunner.java:260)
>     org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:208)
> Thread 7890 (sendParams-3):
>   State: TIMED_WAITING
>   Blocked count: 14
>   Waited count: 50144
>   Stack:
>     sun.misc.Unsafe.park(Native Method)
>     java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
>
> java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQu
> eue.java:424)
>
> java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.
> java:323)
>     java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:874)
>
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:945)
>
>
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907>
)
>     java.lang.Thread.run(Thread.java:619)
> Thread 346 (IPC Client (47) connection to
> ec2-184-73-96-34.compute-1.amazonaws.com/10.195.194.223:8021
> <http://ec2-184-73-96-34.compute-1.amazonaws.com/10.195.194.223:8021>  from
> mapred):
>   State: TIMED_WAITING
>   Blocked count: 128340
>   Waited count: 127868
>   Stack:
>     java.lang.Object.wait(Native Method)
>     org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:645)
>     org.apache.hadoop.ipc.Client$Connection.run(Client.java:688)
> Thread 10 (taskCleanup):
>   State: WAITING
>   Blocked count: 24
>   Waited count: 32
>   Waiting on
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@4d480ea
>   Stack:
>     sun.misc.Unsafe.park(Native Method)
>     java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(Ab
> stractQueuedSynchronizer.java:1987)
>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
>     org.apache.hadoop.mapred.TaskTracker$1.run(TaskTracker.java:397)
>     java.lang.Thread.run(Thread.java:619)
> Thread 16 (Thread-7):
>   State: WAITING
>   Blocked count: 2352
>   Waited count: 2782
>   Waiting on
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@1b5b8520
>   Stack:
>     sun.misc.Unsafe.park(Native Method)
>     java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(Ab
> stractQueuedSynchronizer.java:1987)
>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
>
> org.apache.hadoop.mapreduce.server.tasktracker.userlogs.UserLogManager.monitor
> (UserLogManager.java:79)
>
> org.apache.hadoop.mapreduce.server.tasktracker.userlogs.UserLogManager$1.run(U
> serLogManager.java:46)
> Thread 17 (Thread-8):
>   State: TIMED_WAITING
>   Blocked count: 0
>   Waited count: 58
>   Stack:
>     java.lang.Thread.sleep(Native Method)
>     org.apache.hadoop.mapred.UserLogCleaner.run(UserLogCleaner.java:92)
> Thread 37 (TaskLauncher for REDUCE tasks):
>   State: WAITING
>   Blocked count: 23
>   Waited count: 36
>   Waiting on java.util.LinkedList@158105e8
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.Object.wait(Object.java:485)
>
> org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:2119)
> Thread 36 (TaskLauncher for MAP tasks):
>   State: WAITING
>   Blocked count: 502
>   Waited count: 555
>   Waiting on java.util.LinkedList@49aacd5f
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.Object.wait(Object.java:485)
>
> org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:2119)
> Thread 33 (Map-events fetcher for all reduce tasks on
> tracker_ip-10-116-247-252.ec2.internal:localhost/127.0.0.1:54196
> <http://127.0.0.1:54196> ):
>   State: TIMED_WAITING
>   Blocked count: 316394
>   Waited count: 377686
>   Stack:
>     java.lang.Object.wait(Native Method)
>
> org.apache.hadoop.mapred.TaskTracker$MapEventsFetcherThread.run(TaskTracker.ja
> va:859)
> Thread 30 (IPC Server handler 7 on 54196):
>   State: WAITING
>   Blocked count: 1140
>   Waited count: 77439
>   Waiting on
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@739abd2b
>   Stack:
>     sun.misc.Unsafe.park(Native Method)
>     java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(Ab
> stractQueuedSynchronizer.java:1987)
>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
>     org.apache.hadoop.ipc.Server$Handler.run(Server.java:1295)
> Thread 29 (IPC Server handler 6 on 54196):
>   State: WAITING
>   Blocked count: 1134
>   Waited count: 77434
>   Waiting on
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@739abd2b
>   Stack:
>     sun.misc.Unsafe.park(Native Method)
>     java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(Ab
> stractQueuedSynchronizer.java:1987)
>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
>     org.apache.hadoop.ipc.Server$Handler.run(Server.java:1295)
> Thread 28 (IPC Server handler 5 on 54196):
>   State: WAITING
>   Blocked count: 1165
>   Waited count: 77436
>   Waiting on
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@739abd2b
>   Stack:
>     sun.misc.Unsafe.park(Native Method)
>     java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(Ab
> stractQueuedSynchronizer.java:1987)
>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
>     org.apache.hadoop.ipc.Server$Handler.run(Server.java:1295)
> Thread 27 (IPC Server handler 4 on 54196):
>   State: WAITING
>   Blocked count: 1131
>   Waited count: 77435
>   Waiting on
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@739abd2b
>   Stack:
>     sun.misc.Unsafe.park(Native Method)
>     java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(Ab
> stractQueuedSynchronizer.java:1987)
>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
>     org.apache.hadoop.ipc.Server$Handler.run(Server.java:1295)
> Thread 26 (IPC Server handler 3 on 54196):
>   State: WAITING
>   Blocked count: 1015
>   Waited count: 77443
>   Waiting on
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@739abd2b
>   Stack:
>     sun.misc.Unsafe.park(Native Method)
>     java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(Ab
> stractQueuedSynchronizer.java:1987)
>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
>     org.apache.hadoop.ipc.Server$Handler.run(Server.java:1295)
> Thread 25 (IPC Server handler 2 on 54196):
>   State: WAITING
>   Blocked count: 1027
>   Waited count: 77437
>   Waiting on
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@739abd2b
>   Stack:
>     sun.misc.Unsafe.park(Native Method)
>     java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(Ab
> stractQueuedSynchronizer.java:1987)
>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
>     org.apache.hadoop.ipc.Server$Handler.run(Server.java:1295)
> Thread 24 (IPC Server handler 1 on 54196):
>   State: WAITING
>   Blocked count: 1083
>   Waited count: 77444
>   Waiting on
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@739abd2b
>   Stack:
>     sun.misc.Unsafe.park(Native Method)
>     java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(Ab
> stractQueuedSynchronizer.java:1987)
>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
>     org.apache.hadoop.ipc.Server$Handler.run(Server.java:1295)
> Thread 23 (IPC Server handler 0 on 54196):
>   State: WAITING
>   Blocked count: 1018
>   Waited count: 77435
>   Waiting on
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@739abd2b
>   Stack:
>     sun.misc.Unsafe.park(Native Method)
>     java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(Ab
> stractQueuedSynchronizer.java:1987)
>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
>     org.apache.hadoop.ipc.Server$Handler.run(Server.java:1295)
> Thread 20 (IPC Server listener on 54196):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 1
>   Stack:
>     sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>     sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
>     sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
>     sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
>     sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
>     sun.nio.ch.SelectorImpl.select(SelectorImpl.java:84)
>     org.apache.hadoop.ipc.Server$Listener.run(Server.java:379)
> Thread 22 (IPC Server Responder):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 0
>   Stack:
>     sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>     sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
>     sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
>     sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
>     sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
>     org.apache.hadoop.ipc.Server$Responder.run(Server.java:539)
> Thread 21 (Timer thread for monitoring rpc):
>   State: TIMED_WAITING
>   Blocked count: 0
>   Waited count: 41380
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.util.TimerThread.mainLoop(Timer.java:509)
>     java.util.TimerThread.run(Timer.java:462)
> Thread 19 (Timer thread for monitoring jvm):
>   State: TIMED_WAITING
>   Blocked count: 0
>   Waited count: 41380
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.util.TimerThread.mainLoop(Timer.java:509)
>     java.util.TimerThread.run(Timer.java:462)
> Thread 18 (Directory/File cleanup thread):
>   State: WAITING
>   Blocked count: 1593
>   Waited count: 2038
>   Waiting on
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@44aea710
>   Stack:
>     sun.misc.Unsafe.park(Native Method)
>     java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(Ab
> stractQueuedSynchronizer.java:1987)
>
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
>
> org.apache.hadoop.mapred.CleanupQueue$PathCleanupThread.run(CleanupQueue.java:
> 130)
> Thread 15 (Timer-0):
>   State: TIMED_WAITING
>   Blocked count: 1
>   Waited count: 6898
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.util.TimerThread.mainLoop(Timer.java:509)
>     java.util.TimerThread.run(Timer.java:462)
> Thread 14 (1720982966@qtp0-0 - Acceptor0 SelectChannelConnector@0.0.0.0:50060
> <http://SelectChannelConnector@0.0.0.0:50060> ):
>   State: RUNNABLE
>   Blocked count: 537
>   Waited count: 43
>   Stack:
>     sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>     sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
>     sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
>     sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
>     sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
>
>
org.mortbay.io.nio.SelectorManager$SelectSet.doSelect(SelectorManager.java:429>
)
>     org.mortbay.io.nio.SelectorManager.doSelect(SelectorManager.java:185)
>
> org.mortbay.jetty.nio.SelectChannelConnector.accept(SelectChannelConnector.jav
> a:124)
>
> org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:707)
>
> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)
> Thread 12 (Timer thread for monitoring mapred):
>   State: TIMED_WAITING
>   Blocked count: 1
>   Waited count: 41382
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.util.TimerThread.mainLoop(Timer.java:509)
>     java.util.TimerThread.run(Timer.java:462)
> Thread 4 (Signal Dispatcher):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 0
>   Stack:
> Thread 3 (Finalizer):
>   State: WAITING
>   Blocked count: 1235
>   Waited count: 1226
>   Waiting on java.lang.ref.ReferenceQueue$Lock@309fe84e
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
>     java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
>     java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
> Thread 2 (Reference Handler):
>   State: WAITING
>   Blocked count: 1344
>   Waited count: 1340
>   Waiting on java.lang.ref.Reference$Lock@524c71d2
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.Object.wait(Object.java:485)
>     java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
> Thread 1 (main):
>   State: RUNNABLE
>   Blocked count: 68838
>   Waited count: 205783
>   Stack:
>     sun.management.ThreadImpl.getThreadInfo0(Native Method)
>     sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:147)
>     sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:123)
>
> org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:14
> 9)
>
> org.apache.hadoop.util.ReflectionUtils.logThreadInfo(ReflectionUtils.java:203)
>
> org.apache.hadoop.mapred.TaskTracker.markUnresponsiveTasks(TaskTracker.java:17
> 71)
>     org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:1538)
>     org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:2280)
>     org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3504)
>
> 2011-01-29 00:46:31,039 INFO org.apache.hadoop.mapred.TaskTracker: About to
> purge task: attempt_201101261418_0021_m_000193_0
>
> ----------------------------------------------
> 2011-01-29 05:47:09,730 INFO org.apache.hadoop.mapred.TaskInProgress: Error
> from attempt_201101261418_0021_m_000193_1: Task
> attempt_201101261418_0021_m_000193_1 failed to report status for 1
> 8001 seconds. Killing!
> 2011-01-29 05:47:09,731 INFO org.apache.hadoop.mapred.JobTracker: Removing
> task 'attempt_201101261418_0021_m_000193_1'
> 2011-01-29 05:47:09,731 INFO org.apache.hadoop.mapred.JobTracker: Adding task
> (TASK_CLEANUP) 'attempt_201101261418_0021_m_000193_1' to tip
> task_201101261418_0021_m_000193, for tracker
> 'tracker_ip-10-116-186-223.ec2.internal:localhost/127.0.0.1:37193
> <http://127.0.0.1:37193> '
> 2011-01-29 05:47:12,891 INFO org.apache.hadoop.mapred.JobInProgress: Choosing
> a non-local task task_201101261418_0021_m_000544
> 2011-01-29 05:47:12,891 INFO org.apache.hadoop.mapred.JobTracker: Adding task
> (MAP) 'attempt_201101261418_0021_m_000544_0' to tip
> task_201101261418_0021_m_000544, for tracker
> 'tracker_ip-10-116-186-223.ec2.internal:localhost/127.0.0.1:37193
> <http://127.0.0.1:37193> '
> 2011-01-29 05:47:12,891 INFO org.apache.hadoop.mapred.JobTracker: Removing
> task 'attempt_201101261418_0021_m_000193_1'
> 2011-01-29 05:47:55,312 INFO org.apache.hadoop.mapred.JobInProgress: Task
> 'attempt_201101261418_0021_m_000517_0' has completed
> task_201101261418_0021_m_000517 successfully.
> 2011-01-29 05:47:55,312 INFO org.apache.hadoop.mapred.JobInProgress: Choosing
> a non-local task task_201101261418_0021_m_000193
> 2011-01-29 05:47:55,312 INFO org.apache.hadoop.mapred.JobTracker: Adding task
> (MAP) 'attempt_201101261418_0021_m_000193_2' to tip
> task_201101261418_0021_m_000193, for tracker
> 'tracker_ip-10-114-209-180.ec2.internal:localhost/127.0.0.1:52557
> <http://127.0.0.1:52557> '
>   (Note that the corresponding tasktracker log from
> tracker_ip-10-116-186-223.ec2.internal I don't have)
>
> ----------------------------------------------
> 2011-01-29 10:48:12,869 INFO org.apache.hadoop.mapred.TaskInProgress: Error
> from attempt_201101261418_0021_m_000193_2: Task
> attempt_201101261418_0021_m_000193_2 failed to report status for 1
> 8000 seconds. Killing!
> 2011-01-29 10:48:12,869 INFO org.apache.hadoop.mapred.JobTracker: Removing
> task 'attempt_201101261418_0021_m_000193_2'
> 2011-01-29 10:48:12,869 INFO org.apache.hadoop.mapred.JobTracker: Adding task
> (TASK_CLEANUP) 'attempt_201101261418_0021_m_000193_2' to tip
> task_201101261418_0021_m_000193, for tracker
> 'tracker_ip-10-114-209-180.ec2.internal:localhost/127.0.0.1:52557
> <http://127.0.0.1:52557> '
> 2011-01-29 10:48:15,995 INFO org.apache.hadoop.mapred.JobInProgress: Choosing
> a non-local task task_201101261418_0021_m_000722
> 2011-01-29 10:48:15,996 INFO org.apache.hadoop.mapred.JobTracker: Adding task
> (MAP) 'attempt_201101261418_0021_m_000722_0' to tip
> task_201101261418_0021_m_000722, for tracker
> 'tracker_ip-10-114-209-180.ec2.internal:localhost/127.0.0.1:52557
> <http://127.0.0.1:52557> '
> 2011-01-29 10:48:15,996 INFO org.apache.hadoop.mapred.JobTracker: Removing
> task 'attempt_201101261418_0021_m_000193_2'
> 2011-01-29 10:49:35,832 INFO org.apache.hadoop.mapred.JobInProgress: Task
> 'attempt_201101261418_0021_m_000697_0' has completed
> task_201101261418_0021_m_000697 successfully.
> 2011-01-29 10:49:35,833 INFO org.apache.hadoop.mapred.JobInProgress: Choosing
> a non-local task task_201101261418_0021_m_000193
> 2011-01-29 10:49:35,833 INFO org.apache.hadoop.mapred.JobTracker: Adding task
> (MAP) 'attempt_201101261418_0021_m_000193_3' to tip
> task_201101261418_0021_m_000193, for tracker
> 'tracker_ip-10-100-205-180.ec2.internal:localhost/127.0.0.1:45500
> <http://127.0.0.1:45500> '
>
> ----------------------------------------------
> 2011-01-29 15:49:56,006 INFO org.apache.hadoop.mapred.TaskInProgress: Error
> from attempt_201101261418_0021_m_000193_3: Task
> attempt_201101261418_0021_m_000193_3 failed to report status for 1
> 8002 seconds. Killing!
> 2011-01-29 15:49:56,006 INFO org.apache.hadoop.mapred.JobTracker: Removing
> task 'attempt_201101261418_0021_m_000193_3'
> 2011-01-29 15:49:56,007 INFO org.apache.hadoop.mapred.JobTracker: Adding task
> (TASK_CLEANUP) 'attempt_201101261418_0021_m_000193_3' to tip
> task_201101261418_0021_m_000193, for tracker
> 'tracker_ip-10-100-205-180.ec2.internal:localhost/127.0.0.1:45500
> <http://127.0.0.1:45500> '
> 2011-01-29 15:49:59,070 INFO org.apache.hadoop.mapred.TaskInProgress:
> TaskInProgress task_201101261418_0021_m_000193 has failed 4 times.
> 2011-01-29 15:49:59,070 INFO org.apache.hadoop.mapred.JobInProgress: Aborting
> job job_201101261418_0021
> 2011-01-29 15:49:59,070 INFO org.apache.hadoop.mapred.JobInProgress: Killing
> job 'job_201101261418_0021'
> 2011-01-29 15:49:59,072 INFO org.apache.hadoop.mapred.JobTracker: Adding task
> (JOB_CLEANUP) 'attempt_201101261418_0021_m_001294_0' to tip
> task_201101261418_0021_m_001294, for tracker
> 'tracker_ip-10-100-205-180.ec2.internal:localhost/127.0.0.1:45500
> <http://127.0.0.1:45500> '
> 2011-01-29 15:49:59,072 INFO org.apache.hadoop.mapred.JobTracker: Removing
> task 'attempt_201101261418_0021_m_000193_3'
> 2011-01-29 15:50:04,244 INFO org.apache.hadoop.mapred.JobTracker: Removing
> task 'attempt_201101261418_0021_m_000876_0'
> 2011-01-29 15:50:04,900 INFO org.apache.hadoop.mapred.JobTracker: Removing
> task 'attempt_201101261418_0021_m_000781_0'
> 2011-01-29 15:50:05,043 INFO org.apache.hadoop.mapred.JobTracker: Removing
> task 'attempt_201101261418_0021_m_000877_0'
> 2011-01-29 15:50:05,132 INFO org.apache.hadoop.mapred.JobTracker: Removing
> task 'attempt_201101261418_0021_m_000865_0'
> 2011-01-29 15:50:05,736 INFO org.apache.hadoop.mapred.JobTracker: Removing
> task 'attempt_201101261418_0021_m_000875_0'
> 2011-01-29 15:50:05,768 INFO org.apache.hadoop.mapred.JobTracker: Removing
> task 'attempt_201101261418_0021_m_000864_0'
> 2011-01-29 15:50:06,334 INFO org.apache.hadoop.mapred.JobTracker: Removing
> task 'attempt_201101261418_0021_m_000866_0'
> .......
>
> Here is the end of log details:
>
> Thank you for your attention.
>
> Tibor
>
> On Tue, Feb 1, 2011 at 5:33 PM, Koji Noguchi <knogu...@yahoo-inc.com> wrote:
>> (Bcc CDH alias)
>>
>>
>>> Please don't cross-post, CDH questions should go to their user lists.
>>>
>> Was this CDH specific?
>>
>> Did the job show up as failed on the jobtracker webui?
>> If yes, can you grep a jobtracker log to see something like
>>
>> 2011-02-01 00:06:41,510 INFO org.apache.hadoop.mapred.TaskInProgress:
>> TaskInProgress task_201101040441_3330
>> 49_r_000004 has failed 4 times.
>> 2011-02-01 00:06:41,510 INFO org.apache.hadoop.mapred.JobInProgress: Aborting
>> job job_201101040441_333049
>> 2011-02-01 00:06:41,510 INFO org.apache.hadoop.mapred.JobInProgress: Killing
>> job 'job_201101040441_333049'
>>
>> which tells you which task failure caused the job to fail.
>> Then you can look at the userlog of those task attempts to see why they
>> failed.
>>
>> Ideally this info should show up on the webui.
>>
>> On the other hands, if the job just hang for hours, there¹s probably a bug on
>> the framework.
>>
>> Koji
>>
>>
>> On 1/31/11 9:36 PM, "Arun C Murthy" <a...@yahoo-inc.com
>> <http://a...@yahoo-inc.com> > wrote:
>>
>>> Please don't cross-post, CDH questions should go to their user lists.
>>>
>>> On Jan 31, 2011, at 6:15 AM, Kiss Tibor wrote:
>>>
>>>> Hi!
>>>>
>>>> I was running a Hadoop cluster on Amazon EC2 instances, then after 2 days
>>>> of work, one of the worker nodes just simply died (I cannot connect to the
>>>> instance either). That node also appears on the dfshealth as dead node.
>>>>  Until now everything is normal.
>>>>
>>>> Unfortunately the job it was running didn't survived. The cluster it had 8
>>>> worker nodes, each with 4 mappers and 2 reducers. The job in cause it had
>>>> ~1200 map tasks and 10 reduce tasks.
>>>>  One of the node died and I see around 31 failed attempts in the jobtracker
>>>> log.  The log is very similar with the one  somebody placed it here:
>>>> http://pastie.org/pastes/1270614
>>>>
>>>> Some of the attempts (but not all!) has been retried and I saw at least two
>>>> of them which finally is getting in a successful state.
>>>> The following two lines appears several times in my jobtracker log:
>>>> 2011-01-29 15:50:34,956 WARN org.apache.hadoop.mapred.JobInProgress:
>>>> Running list for reducers missing!! Job details are missing.
>>>>  2011-01-29 15:50:34,956 WARN org.apache.hadoop.mapred.JobInProgress:
>>>> Failed cache for reducers missing!! Job details are missing.
>>>>
>>>> These pair of log lines could be the signals that the job couldn't be
>>>> finished by re-scheduling the failed attempts.
>>>>  Nothing special I have seen in namenode logs.
>>>>
>>>> Of course I rerun the failed job which finished successfully. But my
>>>> problem is that I would like to understand the failover conditions. What
>>>> could be lost, which part of the hadoop is not fault tolerant in this sense
>>>> that it happens to see those warnings mentioned earlier. Is there a chance
>>>> to control such kind of situations?
>>>>
>>>> I am using CDH3b3 version, so it is a developing version of Hadoop.
>>>> Somebody knows about a special bug or fix which in the near future can
>>>> solve the problem?
>>>>
>>>> Regards
>>>> Tibor Kiss
>>>>
>>>>
>>>>
>>>
>>>
>
>

Re: What are the conditions or which is the status of re-scheduling feature for failed attempts caused by dying a node?

Reply via email to