Tom,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Not at all; I just think that it's not clear that they are a net win
> for the average user, and so I'm unconvinced that turning them on by
> default is a good idea.  I could be convinced otherwise by suitable
> evidence.  What I'm objecting to is turning them on without making
> any effort to collect such evidence.

As it happens, rather unexpectedly, we had evidence of a bit-flip
happening on a 9.1.24 install show up on IRC today:

https://paste.fedoraproject.org/533186/85041907/

What that shows is the output from:

select * from heap_page_items(get_raw_page('theirtable', 4585));

With a row whose t_ctid is (134222313,18).  Looking at the base-2 format
of 4585 and 134222313:

0000 0000 0000 0000 0001 0001 1110 1001
0000 1000 0000 0000 0001 0001 1110 1001

There appears to be other issues with the page also but this was
discovered through a pg_dump where the user was trying to get data
out to upgrade to something more recent.  Not clear if the errors on the
page all happened at once or if it was over time, of course, but it's at
least possible that this particular area of storage has been degrading
over time and that identifying an error when it was just the bit-flip in
the t_ctid (thanks to a checksum) might have allowed the user to pull
out the data.

During the discussion on IRC, someone else mentioned a similar problem
which was due to not having ECC memory in their server.  As discussed,
that might mean that we wouldn't have caught the corruption since we
only calculate the checksum on the way out of shared_buffers, but it's
also entirely possible that we would have because it could have happened
in kernel space too.

We're still working with the user to see if we can get their data out,
but that looks like pretty good evidence that maybe we should care about
enabling checksums to catch corruption before it causes undo pain for
our users.

The raw page is here: https://paste.fedoraproject.org/533195/48504224/
if anyone is curious to look at it further (we're looking through it
too).

Thanks!

Stephen

Attachment: signature.asc
Description: Digital signature

Reply via email to