On Fri, Feb 10, 2017 at 7:38 PM, Tomas Vondra <tomas.von...@2ndquadrant.com> wrote: > Incidentally, I've been dealing with a checksum failure reported by a > customer last week, and based on the experience I tend to agree that we > don't have the tools needed to deal with checksum failures. I think such > tooling should be a 'must have' for enabling checksums by default. > > In this particular case the checksum failure is particularly annoying > because it happens during recovery (on a standby, after a restart), during > startup, so FATAL means shutdown. > > I've managed to inspect the page in different way (dd and pageinspect from > another instance), and it looks fine - no obvious data corruption, the only > thing that seems borked is the checksum itself, and only three consecutive > bits are flipped in the checksum. So this doesn't seem like a "stale > checksum" - hardware issue is a possibility (the machine has ECC RAM > though), but it might just as easily be a bug in PostgreSQL, when something > scribbles over the checksum due to a buffer overflow, just before we write > the buffer to the OS. So 'false failures' are not entirely impossible thing. > > And no, backups may not be a suitable solution - the failure happens on a > standby, and the page (luckily) is not corrupted on the master. Which means > that perhaps the standby got corrupted by a WAL, which would affect the > backups too. I can't verify this, though, because the WAL got removed from > the archive, already. But it's a possibility. > > So I think we're not ready to enable checksums by default for everyone, not > until we can provide tools to deal with failures like this (I don't think > users will be amused if we tell them to use 'dd' and inspect the pages in a > hex editor). > > ISTM the way forward is to keep the current default (disabled), but to allow > enabling checksums on the fly. That will mostly fix the issue for people who > actually want checksums but don't realize they need to enable them at initdb > time (and starting from scratch is not an option for them), are running on > good hardware and are capable of dealing with checksum errors if needed, > even without more built-in tooling. > > Being able to disable checksums on the fly is nice, but it only really > solves the issue of extra overhead - it does really help with the failures > (particularly when you can't even start the database, because of a checksum > failure in the startup phase). > > So, shall we discuss what tooling would be useful / desirable?
FWIW, I appreciate this analysis and I think it's exactly the kind of thing we need to set a strategy for moving forward. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers