Hi,
We have had a lot of these crashes in the past. Random jobs were
crashing with error code 134. Our environment is also linux-amd64. We
tried all sorts of Hadoop versions, and JVM deployments, but it did not
have any positive effect.
We finally figured out it was a deep-rooted hardware problem.
Communication between different cores of the cpu could get corrupted
once and every while. This was due to a bad combination of the
mainboard, cpu and/or memory. In our case the problem was solved by
replacing all mainboards.
We could pinpoint and reproduce the problem using the following bash
command (run as root):
while /bin/true; do taskset -c 0 echo -ne
'\02...@\0306\0256yy\0210\0304\0004\0327a\0024\0343\0034\0252\0016v\r\0232\0024\0334\0233\0333\0356\0311a\0367\0375ewgkk\0253\0373\0351\007%'
| taskset -c 2 hexdump -b; done | grep 0000020 | grep -v 351
If you see any output on the console, it's means your hardware is
affected. If you see no output for several minutes (or perhaps one
hour), your machine is unlikely to be broken.
Hope this is of any help to you.
Ferdy
zward3x wrote:
Thanks for all help.
Will install u17, hope that this will resolve issue.
Jean-Daniel Cryans-2 wrote:
As I feared, you use the unholy u18... please revert to u17.
See this thread for more information:
http://www.mail-archive.com/common-u...@hadoop.apache.org/msg04633.html
J-D
On Sun, Mar 7, 2010 at 1:32 PM, zward3x <pasalic.zahar...@gmail.com>
wrote:
$ java -version
java version "1.6.0_18"
Java(TM) SE Runtime Environment (build 1.6.0_18-b07)
Java HotSpot(TM) 64-Bit Server VM (build 16.0-b13, mixed mode)
there is nothing in stderr, but here is part from stdout
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00002b19ef8cc34e, pid=12633, tid=1104492864
#
# JRE version: 6.0_18-b07
# Java VM: Java HotSpot(TM) 64-Bit Server VM (16.0-b13 mixed mode
linux-amd64 )
# Problematic frame:
# V [libjvm.so+0x2de34e]
#
# An error report file with more information is saved as:
#
/hadoop/mapred/local/taskTracker/jobcache/job_201003072002_0002/attempt_201003072002_0002_r_000019_0/work/hs_err_pid12633.log
#
# If you would like to submit a bug report, please visit:
# http://java.sun.com/webapps/bugreport/crash.jsp
#
Also, file which is mentioned above (hs_err_pid12633.log) does not exist.
Jean-Daniel Cryans-2 wrote:
i'm using hadoop 0.20.1 and hbase 0.20.3
Sorry I meant java version.
i already try to put
-XX:ErrorFile=/opt/hadoop/hadoop/logs/java/java_error%p.log
in hadoop-env.sh as HADOOP_OPTS but after reduce crash i did not find
any
file on that path.
Todd doesn't talk about that, he said:
Generally along with a nonzero exit code you should see something in
the stderr for that attempt. If you look on the TaskTracker inside
logs/userlogs/attempt_<the failed attempt>/stderr do you see anything
useful?
--
View this message in context:
http://old.nabble.com/Task-process-exit-with-nonzero-status-of-134...-tp27814144p27814802.html
Sent from the HBase User mailing list archive at Nabble.com.