I'm sorry these issues came up, sometimes its the hardware that lets you down.
Somewhat related, but recommended reading for later (when the issues are
settled) over a beer or two;
http://labs.google.com/papers/disk_failures.pdf
----- Original Message ----
From: Dennis Kubes <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Sent: Friday, April 20, 2007 8:50:00 PM
Subject: Hardware Crashes and Garbage Collection on Nutch/Hadoop
So we moved 50 machines to a data center for a beta cluster of a new
search engine based on Nutch and Hadoop. We fired all of the machines
up and started fetching and almost immediately started experiencing JVM
crashes and checksum/IO errors which would cause jobs to fail, tasks to
fail, and random data corruption. After digging through and fixing the
problems we have come up with some observations that may seem obvious
but may also help someone else avoid the same problems.
First, checksum errors are not a normal occurrence. If you are
experiencing a lot of checksum errors or data corruption, you very well
could have a hardware problem. As developers we many times tend to
thing the "problem" is somewhere in our code. Sometimes that is just
not the case. With systems such as Nutch and Hadoop that can put
extreme load on machines and keep it sustained for long periods of time,
hardware problems will surface. When debugging problems, don't limit
yourself to just the software. If you think hardware might be bad you
can use programs such as badblocks and memtest to test it out. In our
cluster we had 2 machines out of 50 that had hardware problems. When
they were removed from the cluster most of the errors disappeared.
Second, running Nutch and Hadoop in a distributed environment means that
jobs are sharing data from other machines. So if a job tracker or data
node keeps failing on a single machine (or a single set of machines),
that machine could have hardware problems. If you continually get
checksum errors over many machines and it seems random it probably
isn't. More than likely it is caused by a single machine or set of
machines having hardware problems and then sharing the data (i.e. map
outputs, etc.) with other machines. The symptoms of this would be when
one or more task fail for checksum or similar IO related errors and then
complete successfully when restarted on a different machine. So the
point it distributed systems mean that where you are seeing the problem
occurring might not be where it is being caused.
Third, running on multiprocessor and multi-core machines on linux, as
many of us do, means some jvm changes that aren't very well documented.
For one, parallel machines will use a parallel garbage collector. You
used to have to set -XX:+UseParallelGC to enable this option. But in
the later versions of Java 5 and Java 6, when running on multi-proc
machines is set by default. Single proc machines will still use the
serial collector. If you start seeing random JVM crashes and in the
core dump files you see lines like this:
VM_Operation (0x69a55580): parallel gc failed allocation
This means that a parallel garbage collector bug bit the JVM. To turn
off parallel garbage collection, use the -XX:+UseSerialGC option. This
can be set in the hadoop-default.xml file in the child opts and in the
hadoop-env.sh file in the HADOOP_OPTS variable.
So part of this is just to rant, but in part I hope some of this
information helps someone else to avoid having to spend a week tracking
down hardware and weird JVM problems.
Dennis Kubes
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general