Re: Many Checksum Errors

Raghu Angadi Wed, 16 May 2007 11:08:16 -0700

Dennis Kubes wrote:

It turns out that ECC memory did the trick. We replaced all memory onour 50 node cluster with ECC memory and it has just completed a 50Million page crawl and merge with 0 errors. Before we would have 10-20errors or more on this job.
I still find it interesting that the non-ecc memory passed all burn-inand hardware tests yet still failed randomly under productionconditions. I guess a good rule of thumb would be for production nutchand hadoop systems ECC memory is always the way to go. Anyways, thanksfor all the help in getting this problem resolved.

This is good validation how important ECC memory is. Currently HDFSclient deletes a block when it notices a checksum error. After moving toBlock level CRCs soon, we should make Datanode re-validate the blockbefore deciding to delete it.


Raghu.

Dennis Kubes

Dennis Kubes wrote:
Doug Cutting wrote:
Dennis Kubes wrote:
Do we know if this is a hardware issue. If it is possibly asoftware issue I can dedicate some resources to tracking down bugs.I would just need a little guidance on where to start looking?
We don't know. The checksum mechanism is designed to catch hardwareproblems. So one must certainly consider that as a likely cause. Ifit is instead a software bug then it should be reproducible. Are youseeing any consistent patterns? If not, then I'd lean towards hardware.
Michael Stack has some experience tracking down problems with flakymemory. Michael, did you use a test program to validate the memoryon a node?
Again, do your nodes have ECC memory?
Sorry, I was checking on that. No, the nodes don't have ECC memory.I just priced it out and it is only $20 more per Gig to go ECC so Ithink that is what we are going to do. We are going to do some testsand I will keep the list updated on the progress. Thanks for your help.
Dennis Kubes
Doug

Re: Many Checksum Errors

Reply via email to