Hi Todd et al.,

coming back to this again, I'd like to present a solution we found and that indeed a JVM bug was the cause of seeing exit code 134 on the TaskRunner.

First of all, we had to configure the Hadoop subsystem to start with the following parameter:

-XX:ErrorFile=/opt/hadoop/hadoop/logs/java/java_error%p.log

This was necessary, because without it, the JVM would -- by default -- put this standard logfile into the current working directory, which in this case was the Hadoop task working directory. This directory, however, got removed upon job failing or completion.

The java error logfile pointed us to a specific class and method that kept crashing the JVM, namely: DefaultSDContextGenerator.previousSpaceIndex(CharSequence, int): int

We eventually googled for this specific class and method, and lo and behold, found this:
http://sourceforge.net/tracker/?func=detail&aid=2793972&group_id=3368&atid=103368

Apparently, this specific class and method had triggered JVM crashes for other users as well. We implemented the workaround code and the trouble with exit code 134 was finally gone.

On that webpage, someone posted in the comments a code snippet to reproduce the JVM crash. I have not yet confirmed whether it was reported to Sun as well.

Cheers,
Chris


Todd Lipcon schrieb:
Hi Christian,

Generally along with a nonzero exit code you should see something in the stderr for that attempt. If you look on the TaskTracker inside logs/userlogs/attempt_<the failed attempt>/stderr do you see anything useful?

If it's a segfault or a linux OOM kill, you should also see something in your system's kernel log. Check "dmesg" and/or /var/log/kern.log for anything suspicious looking.

Hope that helps
-Todd

On Tue, Jul 21, 2009 at 2:15 AM, Christian Kirschbaum <[email protected] <mailto:[email protected]>> wrote:

    Hi all,

    we're using Hadoop 0.19.1 and have recently encountered the
    following erratic problem when running jobs involving UIMA text
    annotation chains (which fail frequently because of this):

    java.io.IOException: Task process exit with nonzero status of 134.
           at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:425)


    As you can see, this is propagated in Hadoop code, without the
    actual MapReduce job being able to react to it. Unfortunately,
    this exception message isn't very descriptive as to the actual
    cause which I have yet to track down.

    All I found out is that this status code apparently is an exit
    code of a separate process initiated through
    org.apache.hadoop.util.Shell.ShellCommandExecutor in the
    runChild(JvmEnv) method of org.apache.hadoop.mapred.JvmManager.
    And because it is exit code 134 (128 + 6), supposedly signal 6
    (ABORT) has effected the process termination which may indicate a
    core dump?

    How do I find out more about the actual cause? Is there any secret
    logfile for the separately spawned Jvm process? I've looked
    through various logs and userlogs directories but could not find
    any mention of this exception there.

    Any help is appreciated.

    Thanks,
    Chris





--
Christian Kirschbaum
Software Developer
--------------------------------------------------------
vionto GmbH
Karl-Marx-Allee 90a, D-10243 Berlin

fon   +49 30 40 20 329 - 27
fax   +49 30 40 20 329 - 01
web   http://www.vionto.com
--------------------------------------------------------
Geschäftsführer: Ralf von Grafenstein, Dr. Martin Hirsch
Sitz der Gesellschaft: Berlin
Amtsgericht Berlin Charlottenburg, HRB 108154B
--------------------------------------------------------

Reply via email to