On Sep 30, 2008, at 2:17 PM, [EMAIL PROTECTED] wrote:
A customer of ours has been having trouble with corrupted data for some
time.  Of course, we've almost always blamed hardware (and we've seen
RAID controllers have their firmware upgraded, among other actions), but
the useful thing to know is when corruption has happened, and where.

That is an important statement, to know when it happens not necessarily to be able to recover the block or where in the block it is corrupt. Is that
correct?

Oh, correcting the corruption would be AWESOME beyond belief! But at this point I'd settle for just knowing it had happened.

So we've been tasked with adding CRCs to data files.

CRC or checksum? If the objective is merely general "detection" there
should be some latitude in choosing the methodology for performance.

See above. Perhaps the best win would be a case where you could choose which method you wanted. We generally have extra CPU on the servers, so we could afford to burn some cycles with more complex algorithms.

The idea is that these CRCs are going to be checked just after reading
files from disk, and calculated just before writing it.  They are
just a protection against the storage layer going mad; they are not
intended to protect against faulty RAM, CPU or kernel.

It will actually find faults in all if it. If the CPU can't add and/ or a RAM location lost a bit, this will blow up just as easily as a bad block. It may cause "false identification" of an error, but it will keep a bad
system from hiding.

Well, very likely not, since the intention is to only compute the CRC when we write the block out, at least for now. In the future I would like to be able to detect when a CPU or memory goes bonkers and poops on something, because that's actually happened to us as well.

The implementation I'm envisioning requires the use of a new relation
fork to store the per-block CRCs. Initially I'm aiming at a CRC32 sum for each block. FlushBuffer would calculate the checksum and store it
in the CRC fork; ReadBuffer_common would read the page, calculate the
checksum, and compare it to the one stored in the CRC fork.

Hell, all that is needed is a long or a short checksum value in the block.
I mean, if you just want a sanity test, it doesn't take much. Using a
second relation creates confusion. If there is a CRC discrepancy between two different blocks, who's wrong? You need a third "control" to know. If
the block knows its CRC or checksum and that is in error, the block is
bad.

I believe the idea was to make this as non-invasive as possible. And it would be really nice if this could be enabled without a dump/ reload (maybe the upgrade stuff would make this possible?)
--
Decibel!, aka Jim C. Nasby, Database Architect  [EMAIL PROTECTED]
Give your computer some brain candy! www.distributed.net Team #1828


Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to