On Fri, Mar 8, 2019 at 6:50 PM Tomas Vondra <tomas.von...@2ndquadrant.com> wrote: > > On 3/8/19 4:19 PM, Julien Rouhaud wrote: > > On Thu, Mar 7, 2019 at 7:00 PM Andres Freund <and...@anarazel.de> wrote: > >> > >> On 2019-03-07 12:53:30 +0100, Tomas Vondra wrote: > >>> > >>> But then again, we could just > >>> hack a special version of ReadBuffer_common() which would just > >> > >>> (a) check if a page is in shared buffers, and if it is then consider the > >>> checksum correct (because in memory it may be stale, and it was read > >>> successfully so it was OK at that moment) > >>> > >>> (b) if it's not in shared buffers already, try reading it and verify the > >>> checksum, and then just evict it right away (not to spoil sb) > >> > >> This'd also make sense and make the whole process more efficient. OTOH, > >> it might actually be worthwhile to check the on-disk page even if > >> there's in-memory state. Unless IO is in progress the on-disk page > >> always should be valid. > > > > Definitely. I already saw servers with all-frozen-read-only blocks > > popular enough to never get evicted in months, and then a minor > > upgrade / restart having catastrophic consequences. > > > > Do I understand correctly the "catastrophic consequences" here are due > to data corruption / broken checksums on those on-disk pages?
Ah, yes sorry I should have been clearer. Indeed, there was silent data corruptions (no ckecksum though) that was revealed by the restart. So a routine minor update resulted in a massive outage. Such a scenario can't be avoided if we always bypass checksum check for alreay in shared_buffers pages.