Hi, There :
I am running a CDH3b2 distribution of hadoop and it has been working for
several weeks. last night all tasks begin to fail . not a single job
finished successfully.
Here are the relevant information: the jobtracker .out file is empty.
I was suspecting out of memory error, as the "top" shows the RES to be 2G,
while my command
line I only give it 1G. , but I searched the log and that error message
doesn't exist.
Can anybody give me any clue how to find out the cause and how to fix this ?
Jimmy
here is the record from jobtracker:
g job_201010011833_72867
2010-10-13 17:00:55,003 INFO org.apache.hadoop.mapred.TaskInProgress: Error
from
attempt_201010011833_72860_r_000001_0: java.lang.Throwable: Child Error
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:472)
Caused by: java.io.IOException: Task process exit with nonzero status of
126.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:459)
[had...@m0002041 logs]$ grep attempt_201010011833_63031_m_000020_3
hadoop-hadoop
-jobtracker-m0002041.log
2010-10-13 00:04:56,361 INFO org.apache.hadoop.mapred.JobTracker: Adding
task 'a
ttempt_201010011833_63031_m_000020_3' to tip
task_201010011833_63031_m_000020, f
or tracker
'tracker_m0002014.ppops.net:localhost.localdomain/127.0.0.1:40961'
2010-10-13 00:04:59,365 INFO org.apache.hadoop.mapred.TaskInProgress: Error
from
attempt_201010011833_63031_m_000020_3: java.lang.Throwable: Child Error
2010-10-13 00:05:02,423 INFO org.apache.hadoop.mapred.JobTracker: Removed
comple
ted task 'attempt_201010011833_63031_m_000020_3' from
'tracker_m0002014.ppops.ne
t:localhost.localdomain/127.0.0.1:40961'
here is the result from the tasktracker:
[r...@m0002014 logs]# grep attempt_201010011833_63031_m_000020_3
hadoop-hadoop-
tasktracker-m0002014.log
2010-10-13 00:04:56,362 INFO org.apache.hadoop.mapred.TaskTracker:
LaunchTaskAct
ion (registerTask): attempt_201010011833_63031_m_000020_3 task's
state:UNASSIGNE
D
2010-10-13 00:04:56,362 INFO org.apache.hadoop.mapred.TaskTracker: Trying to
lau
nch : attempt_201010011833_63031_m_000020_3
2010-10-13 00:04:56,362 INFO org.apache.hadoop.mapred.TaskTracker: In
TaskLaunch
er, current free slots : 2 and trying to launch
attempt_201010011833_63031_m_000
020_3
2010-10-13 00:04:56,449 WARN org.apache.hadoop.mapred.TaskRunner:
attempt_201010
011833_63031_m_000020_3Child Error
2010-10-13 00:04:59,455 INFO org.apache.hadoop.mapred.TaskRunner:
attempt_201010
011833_63031_m_000020_3 done; removing files.
2010-10-13 00:04:56,449 WARN org.apache.hadoop.mapred.TaskRunner:
attempt_201010011833_63031_m_000020_3Child Error
java.io.IOException: Task process exit with nonzero status of 126.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:459)
2010-10-13 00:04:56,449 INFO org.apache.hadoop.mapred.JvmManager: JVM :
jvm_2010
10011833_63031_m_-772401408 exited. Number of tasks it ran: 0
here is the relevant startup command:
hadoop 11874 1 6 Oct01 ? 18:09:24
/usr/java/latest/bin/java -Xmx10
00m -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxre
mote.ssl=false -XX:+UseConcMarkSweepGC -XX:+DisableExplicitGC -XX:+HeapDumpOnOut
OfMemoryError -XX:+UseCompressedOops -XX:+DoEscapeAnalysis -XX:+AggressiveOpts
-
Dcom.sun.management.jmxremote -Xmx4G -Dcom.sun.management.jmxremote.port=8008
-v
erbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:/home/hadoop/hadoop
/logs/gc-jobtracker.log -Dhadoop.log.dir=/home/hadoop/hadoop/logs -Dhadoop.log.f
top result:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1874 hadoop 23 0 4826m 2.0g 10m S 10.0 25.3 1089:29 java
top -H result:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAN
11973 hadoop 15 0 4826m 2.0g 10m S 8.3 25.3 35:08.46 java
11974 hadoop 15 0 4826m 2.0g 10m S 0.7 25.3 35:18.06 java
12029 hadoop 16 0 4826m 2.0g 10m S 0.3 25.3 3:21.17 java
12538 hadoop 16 0 4826m 2.0g 10m S 0.3 25.3 7:18.12 java
12539 hadoop 15 0 4826m 2.0g 10m S 0.3 25.3 0:59.08 java
11874 hadoop 18 0 4826m 2.0g 10m S 0.0 25.3 0:00.00 java
11906 hadoop 23 0 4826m 2.0g 10m S 0.0 25.3 0:01.16 java
11907 hadoop 16 0 4826m 2.0g 10m S 0.0 25.3 4:55.39 java
11908 hadoop 16 0 4826m 2.0g 10m S 0.0 25.3 4:55.71 java
he is the count of the error trend:
[had...@m0002041 logs]$ grep -i error hadoop-hadoop-jobtracker-m0002041.log
| wc
113559 1135590 17374527
[had...@m0002041 logs]$ grep -i error
hadoop-hadoop-jobtracker-m0002041.log.2010
-10-12 | wc
111368 1113680 17039304
[had...@m0002041 logs]$ grep -i error
hadoop-hadoop-jobtracker-m0002041.log.2010
-10-11 | wc
5163 51638 790076
[had...@m0002041 logs]$ grep -i error
hadoop-hadoop-jobtracker-m0002041.log.2010
-10-10 | wc
29 316 4850
[had...@m0002041 logs]$ grep -i error
hadoop-hadoop-jobtracker-m0002041.log.2010
-10-09 | wc
35 412 7492
[had...@m0002041 logs]$ grep -i error
hadoop-hadoop-jobtracker-m0002041.log.2010
-10-08 | wc
26 346 5088