Re: Online verification of checksums

Julien Rouhaud Fri, 08 Mar 2019 10:00:34 -0800

On Fri, Mar 8, 2019 at 6:50 PM Tomas Vondra
<[email protected]> wrote:
>
> On 3/8/19 4:19 PM, Julien Rouhaud wrote:
> > On Thu, Mar 7, 2019 at 7:00 PM Andres Freund <[email protected]> wrote:
> >>
> >> On 2019-03-07 12:53:30 +0100, Tomas Vondra wrote:
> >>>
> >>> But then again, we could just
> >>> hack a special version of ReadBuffer_common() which would just
> >>
> >>> (a) check if a page is in shared buffers, and if it is then consider the
> >>> checksum correct (because in memory it may be stale, and it was read
> >>> successfully so it was OK at that moment)
> >>>
> >>> (b) if it's not in shared buffers already, try reading it and verify the
> >>> checksum, and then just evict it right away (not to spoil sb)
> >>
> >> This'd also make sense and make the whole process more efficient. OTOH,
> >> it might actually be worthwhile to check the on-disk page even if
> >> there's in-memory state. Unless IO is in progress the on-disk page
> >> always should be valid.
> >
> > Definitely.  I already saw servers with all-frozen-read-only blocks
> > popular enough to never get evicted in months, and then a minor
> > upgrade / restart having catastrophic consequences.
> >
>
> Do I understand correctly the "catastrophic consequences" here are due
> to data corruption / broken checksums on those on-disk pages?


Ah, yes sorry I should have been clearer.  Indeed, there was silent
data corruptions (no ckecksum though) that was revealed by the
restart.  So a routine minor update resulted in a massive outage.
Such a scenario can't be avoided if we always bypass checksum check
for alreay in shared_buffers pages.

Re: Online verification of checksums

Reply via email to