Hi Todd et al.,
coming back to this again, I'd like to present a solution we found and
that indeed a JVM bug was the cause of seeing exit code 134 on the
TaskRunner.
First of all, we had to configure the Hadoop subsystem to start with the
following parameter:
-XX:ErrorFile=/opt/hadoop/hadoop/logs/java/java_error%p.log
This was necessary, because without it, the JVM would -- by default --
put this standard logfile into the current working directory, which in
this case was the Hadoop task working directory. This directory,
however, got removed upon job failing or completion.
The java error logfile pointed us to a specific class and method that
kept crashing the JVM, namely:
DefaultSDContextGenerator.previousSpaceIndex(CharSequence, int): int
We eventually googled for this specific class and method, and lo and
behold, found this:
http://sourceforge.net/tracker/?func=detail&aid=2793972&group_id=3368&atid=103368
Apparently, this specific class and method had triggered JVM crashes for
other users as well. We implemented the workaround code and the trouble
with exit code 134 was finally gone.
On that webpage, someone posted in the comments a code snippet to
reproduce the JVM crash. I have not yet confirmed whether it was
reported to Sun as well.
Cheers,
Chris
Todd Lipcon schrieb:
Hi Christian,
Generally along with a nonzero exit code you should see something in
the stderr for that attempt. If you look on the TaskTracker inside
logs/userlogs/attempt_<the failed attempt>/stderr do you see anything
useful?
If it's a segfault or a linux OOM kill, you should also see something
in your system's kernel log. Check "dmesg" and/or /var/log/kern.log
for anything suspicious looking.
Hope that helps
-Todd
On Tue, Jul 21, 2009 at 2:15 AM, Christian Kirschbaum
<[email protected]
<mailto:[email protected]>> wrote:
Hi all,
we're using Hadoop 0.19.1 and have recently encountered the
following erratic problem when running jobs involving UIMA text
annotation chains (which fail frequently because of this):
java.io.IOException: Task process exit with nonzero status of 134.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:425)
As you can see, this is propagated in Hadoop code, without the
actual MapReduce job being able to react to it. Unfortunately,
this exception message isn't very descriptive as to the actual
cause which I have yet to track down.
All I found out is that this status code apparently is an exit
code of a separate process initiated through
org.apache.hadoop.util.Shell.ShellCommandExecutor in the
runChild(JvmEnv) method of org.apache.hadoop.mapred.JvmManager.
And because it is exit code 134 (128 + 6), supposedly signal 6
(ABORT) has effected the process termination which may indicate a
core dump?
How do I find out more about the actual cause? Is there any secret
logfile for the separately spawned Jvm process? I've looked
through various logs and userlogs directories but could not find
any mention of this exception there.
Any help is appreciated.
Thanks,
Chris
--
Christian Kirschbaum
Software Developer
--------------------------------------------------------
vionto GmbH
Karl-Marx-Allee 90a, D-10243 Berlin
fon +49 30 40 20 329 - 27
fax +49 30 40 20 329 - 01
web http://www.vionto.com
--------------------------------------------------------
Geschäftsführer: Ralf von Grafenstein, Dr. Martin Hirsch
Sitz der Gesellschaft: Berlin
Amtsgericht Berlin Charlottenburg, HRB 108154B
--------------------------------------------------------