Re: [HACKERS] Checksums by default?

Tomas Vondra Sun, 12 Feb 2017 18:42:35 -0800

On 02/13/2017 02:29 AM, Jim Nasby wrote:

On 2/10/17 6:38 PM, Tomas Vondra wrote:

And no, backups may not be a suitable solution - the failure happens on
a standby, and the page (luckily) is not corrupted on the master. Which
means that perhaps the standby got corrupted by a WAL, which would
affect the backups too. I can't verify this, though, because the WAL got
removed from the archive, already. But it's a possibility.


Possibly related... I've got a customer that periodically has SR replias
stop in their tracks due to WAL checksum failure. I don't think there's
any hardware correlation (they've seen this on multiple machines).
Studying the code, it occurred to me that if there's any bugs in the
handling of individual WAL record sizes or pointers during SR then you
could get CRC failures. So far every one of these occurrences has been
repairable by replacing the broken WAL file on the replica. I've
requested that next time this happens they save the bad WAL.

I don't follow. You're talking about WAL checksums, this thread is aboutdata checksums. I'm not seeing any WAL checksum failure, but when thestandby attempts to apply the WAL (in particular a Btree/DELETE WALrecord), it detects an incorrect data checksum in the underlying table.

So either there's a hardware issue, or the heap got corrupted by somepreceding WAL. Or maybe one of the tiny gnomes in the CPU got tired andpunched the bits wrong.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Checksums by default?

Reply via email to