It turns out that ECC memory did the trick. We replaced all memory on
our 50 node cluster with ECC memory and it has just completed a 50
Million page crawl and merge with 0 errors. Before we would have 10-20
errors or more on this job.
I still find it interesting that the non-ecc memory passed all burn-in
and hardware tests yet still failed randomly under production
conditions. I guess a good rule of thumb would be for production nutch
and hadoop systems ECC memory is always the way to go. Anyways, thanks
for all the help in getting this problem resolved.
Dennis Kubes
Dennis Kubes wrote:
Doug Cutting wrote:
Dennis Kubes wrote:
Do we know if this is a hardware issue. If it is possibly a software
issue I can dedicate some resources to tracking down bugs. I would
just need a little guidance on where to start looking?
We don't know. The checksum mechanism is designed to catch hardware
problems. So one must certainly consider that as a likely cause. If
it is instead a software bug then it should be reproducible. Are you
seeing any consistent patterns? If not, then I'd lean towards hardware.
Michael Stack has some experience tracking down problems with flaky
memory. Michael, did you use a test program to validate the memory on
a node?
Again, do your nodes have ECC memory?
Sorry, I was checking on that. No, the nodes don't have ECC memory. I
just priced it out and it is only $20 more per Gig to go ECC so I think
that is what we are going to do. We are going to do some tests and I
will keep the list updated on the progress. Thanks for your help.
Dennis Kubes
Doug