Hi, There :
I am running a CDH3b2 distribution of hadoop and it has been working for several weeks. last night all tasks begin to fail . not a single job finished successfully.

Here are the relevant information: the  jobtracker .out file is empty.
I was suspecting out of memory error, as the "top" shows the RES to be 2G, while my command line I only give it 1G. , but I searched the log and that error message doesn't exist.


Can anybody give me any clue how to find out the cause and how to fix this ?

Jimmy


here is the record from jobtracker:

g job_201010011833_72867
2010-10-13 17:00:55,003 INFO org.apache.hadoop.mapred.TaskInProgress: Error from
attempt_201010011833_72860_r_000001_0: java.lang.Throwable: Child Error
       at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:472)
Caused by: java.io.IOException: Task process exit with nonzero status of 126.
       at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:459)






[had...@m0002041 logs]$ grep attempt_201010011833_63031_m_000020_3 hadoop-hadoop
-jobtracker-m0002041.log
2010-10-13 00:04:56,361 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'a ttempt_201010011833_63031_m_000020_3' to tip task_201010011833_63031_m_000020, f or tracker 'tracker_m0002014.ppops.net:localhost.localdomain/127.0.0.1:40961' 2010-10-13 00:04:59,365 INFO org.apache.hadoop.mapred.TaskInProgress: Error from
attempt_201010011833_63031_m_000020_3: java.lang.Throwable: Child Error
2010-10-13 00:05:02,423 INFO org.apache.hadoop.mapred.JobTracker: Removed comple ted task 'attempt_201010011833_63031_m_000020_3' from 'tracker_m0002014.ppops.ne
t:localhost.localdomain/127.0.0.1:40961'



here is the result from the tasktracker:

[r...@m0002014 logs]# grep attempt_201010011833_63031_m_000020_3 hadoop-hadoop-
tasktracker-m0002014.log
2010-10-13 00:04:56,362 INFO org.apache.hadoop.mapred.TaskTracker: LaunchTaskAct ion (registerTask): attempt_201010011833_63031_m_000020_3 task's state:UNASSIGNE
D
2010-10-13 00:04:56,362 INFO org.apache.hadoop.mapred.TaskTracker: Trying to lau
nch : attempt_201010011833_63031_m_000020_3
2010-10-13 00:04:56,362 INFO org.apache.hadoop.mapred.TaskTracker: In TaskLaunch er, current free slots : 2 and trying to launch attempt_201010011833_63031_m_000
020_3
2010-10-13 00:04:56,449 WARN org.apache.hadoop.mapred.TaskRunner: attempt_201010
011833_63031_m_000020_3Child Error
2010-10-13 00:04:59,455 INFO org.apache.hadoop.mapred.TaskRunner: attempt_201010
011833_63031_m_000020_3 done; removing files.




2010-10-13 00:04:56,449 WARN org.apache.hadoop.mapred.TaskRunner: attempt_201010011833_63031_m_000020_3Child Error
java.io.IOException: Task process exit with nonzero status of 126.
       at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:459)
2010-10-13 00:04:56,449 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_2010
10011833_63031_m_-772401408 exited. Number of tasks it ran: 0


here is the relevant startup command:

hadoop 11874 1 6 Oct01 ? 18:09:24 /usr/java/latest/bin/java -Xmx10
00m -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxre
mote.ssl=false -XX:+UseConcMarkSweepGC -XX:+DisableExplicitGC -XX:+HeapDumpOnOut
OfMemoryError -XX:+UseCompressedOops -XX:+DoEscapeAnalysis -XX:+AggressiveOpts - Dcom.sun.management.jmxremote -Xmx4G -Dcom.sun.management.jmxremote.port=8008 -v
erbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:/home/hadoop/hadoop
/logs/gc-jobtracker.log -Dhadoop.log.dir=/home/hadoop/hadoop/logs -Dhadoop.log.f

top result:

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
1874 hadoop    23   0 4826m 2.0g  10m S 10.0 25.3   1089:29 java



top -H result:
 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAN
11973 hadoop    15   0 4826m 2.0g  10m S  8.3 25.3  35:08.46 java
11974 hadoop    15   0 4826m 2.0g  10m S  0.7 25.3  35:18.06 java
12029 hadoop    16   0 4826m 2.0g  10m S  0.3 25.3   3:21.17 java
12538 hadoop    16   0 4826m 2.0g  10m S  0.3 25.3   7:18.12 java
12539 hadoop    15   0 4826m 2.0g  10m S  0.3 25.3   0:59.08 java
11874 hadoop    18   0 4826m 2.0g  10m S  0.0 25.3   0:00.00 java
11906 hadoop    23   0 4826m 2.0g  10m S  0.0 25.3   0:01.16 java
11907 hadoop    16   0 4826m 2.0g  10m S  0.0 25.3   4:55.39 java
11908 hadoop    16   0 4826m 2.0g  10m S  0.0 25.3   4:55.71 java


he is the count of the error trend:

[had...@m0002041 logs]$ grep -i error hadoop-hadoop-jobtracker-m0002041.log | wc

113559 1135590 17374527
[had...@m0002041 logs]$ grep -i error hadoop-hadoop-jobtracker-m0002041.log.2010
-10-12 | wc
111368 1113680 17039304
[had...@m0002041 logs]$ grep -i error hadoop-hadoop-jobtracker-m0002041.log.2010
-10-11 | wc
  5163   51638  790076
[had...@m0002041 logs]$ grep -i error hadoop-hadoop-jobtracker-m0002041.log.2010
-10-10 | wc
    29     316    4850
[had...@m0002041 logs]$ grep -i error hadoop-hadoop-jobtracker-m0002041.log.2010
-10-09 | wc
    35     412    7492
[had...@m0002041 logs]$ grep -i error hadoop-hadoop-jobtracker-m0002041.log.2010
-10-08 | wc
    26     346    5088





Reply via email to