I'm sorry these issues came up, sometimes its the hardware that lets you down.
 
Somewhat related, but recommended reading for later (when the issues are 
settled) over a beer or two;
 
http://labs.google.com/papers/disk_failures.pdf


----- Original Message ----
From: Dennis Kubes <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Sent: Friday, April 20, 2007 8:50:00 PM
Subject: Hardware Crashes and Garbage Collection on Nutch/Hadoop


So we moved 50 machines to a data center for a beta cluster of a new 
search engine based on Nutch and Hadoop.  We fired all of the machines 
up and started fetching and almost immediately started experiencing JVM 
crashes and checksum/IO errors which would cause jobs to fail, tasks to 
fail, and random data corruption.  After digging through and fixing the 
problems we have come up with some observations that may seem obvious 
but may also help someone else avoid the same problems.

First, checksum errors are not a normal occurrence.  If you are 
experiencing a lot of checksum errors or data corruption, you very well 
could have a hardware problem.  As developers we many times tend to 
thing the "problem" is somewhere in our code.  Sometimes that is just 
not the case.  With systems such as Nutch and Hadoop that can put 
extreme load on machines and keep it sustained for long periods of time, 
hardware problems will surface.  When debugging problems, don't limit 
yourself to just the software.  If you think hardware might be bad you 
can use programs such as badblocks and memtest to test it out.  In our 
cluster we had 2 machines out of 50 that had hardware problems.  When 
they were removed from the cluster most of the errors disappeared.

Second, running Nutch and Hadoop in a distributed environment means that 
jobs are sharing data from other machines.  So if a job tracker or data 
node keeps failing on a single machine (or a single set of machines), 
that machine could have hardware problems.  If you continually get 
checksum errors over many machines and it seems random it probably 
isn't.  More than likely it is caused by a single machine or set of 
machines having hardware problems and then sharing the data (i.e. map 
outputs, etc.) with other machines.  The symptoms of this would be when 
one or more task fail for checksum or similar IO related errors and then 
complete successfully when restarted on a different machine.  So the 
point it distributed systems mean that where you are seeing the problem 
occurring might not be where it is being caused.

Third, running on multiprocessor and multi-core machines on linux, as 
many of us do, means some jvm changes that aren't very well documented. 
  For one, parallel machines will use a parallel garbage collector.  You 
used to have to set -XX:+UseParallelGC to enable this option.  But in 
the later versions of Java 5 and Java 6, when running on multi-proc 
machines is set by default.  Single proc machines will still use the 
serial collector.  If you start seeing random JVM crashes and in the 
core dump files you see lines like this:

VM_Operation (0x69a55580): parallel gc failed allocation

This means that a parallel garbage collector bug bit the JVM.  To turn 
off parallel garbage collection, use the -XX:+UseSerialGC option.  This 
can be set in the hadoop-default.xml file in the child opts and in the 
hadoop-env.sh file in the HADOOP_OPTS variable.

So part of this is just to rant, but in part I hope some of this 
information helps someone else to avoid having to spend a week tracking 
down hardware and weird JVM problems.

Dennis Kubes
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to