Hi,

We have had a lot of these crashes in the past. Random jobs were crashing with error code 134. Our environment is also linux-amd64. We tried all sorts of Hadoop versions, and JVM deployments, but it did not have any positive effect.

We finally figured out it was a deep-rooted hardware problem. Communication between different cores of the cpu could get corrupted once and every while. This was due to a bad combination of the mainboard, cpu and/or memory. In our case the problem was solved by replacing all mainboards.

We could pinpoint and reproduce the problem using the following bash command (run as root):

while /bin/true; do taskset -c 0 echo -ne '\02...@\0306\0256yy\0210\0304\0004\0327a\0024\0343\0034\0252\0016v\r\0232\0024\0334\0233\0333\0356\0311a\0367\0375ewgkk\0253\0373\0351\007%' | taskset -c 2 hexdump -b; done | grep 0000020 | grep -v 351

If you see any output on the console, it's means your hardware is affected. If you see no output for several minutes (or perhaps one hour), your machine is unlikely to be broken.

Hope this is of any help to you.

Ferdy

zward3x wrote:
Thanks for all help.

Will install u17, hope that this will resolve issue.



Jean-Daniel Cryans-2 wrote:
As I feared, you use the unholy u18... please revert to u17.

See this thread for more information:
http://www.mail-archive.com/common-u...@hadoop.apache.org/msg04633.html

J-D

On Sun, Mar 7, 2010 at 1:32 PM, zward3x <pasalic.zahar...@gmail.com>
wrote:
$ java -version
java version "1.6.0_18"
Java(TM) SE Runtime Environment (build 1.6.0_18-b07)
Java HotSpot(TM) 64-Bit Server VM (build 16.0-b13, mixed mode)

there is nothing in stderr, but here is part from stdout

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00002b19ef8cc34e, pid=12633, tid=1104492864
#
# JRE version: 6.0_18-b07
# Java VM: Java HotSpot(TM) 64-Bit Server VM (16.0-b13 mixed mode
linux-amd64 )
# Problematic frame:
# V  [libjvm.so+0x2de34e]
#
# An error report file with more information is saved as:
#
/hadoop/mapred/local/taskTracker/jobcache/job_201003072002_0002/attempt_201003072002_0002_r_000019_0/work/hs_err_pid12633.log
#
# If you would like to submit a bug report, please visit:
#   http://java.sun.com/webapps/bugreport/crash.jsp
#

Also, file which is mentioned above (hs_err_pid12633.log) does not exist.



Jean-Daniel Cryans-2 wrote:
i'm using hadoop 0.20.1 and hbase 0.20.3
Sorry I meant java version.

i already try to put

-XX:ErrorFile=/opt/hadoop/hadoop/logs/java/java_error%p.log

in hadoop-env.sh as HADOOP_OPTS but after reduce crash i did not find
any
file on that path.
Todd doesn't talk about that, he said:

Generally along with a nonzero exit code you should see something in
the stderr for that attempt. If you look on the TaskTracker inside
logs/userlogs/attempt_<the failed attempt>/stderr do you see anything
useful?
--
View this message in context:
http://old.nabble.com/Task-process-exit-with-nonzero-status-of-134...-tp27814144p27814802.html
Sent from the HBase User mailing list archive at Nabble.com.



Reply via email to