On Mon, Oct 17, 2016 at 4:34 PM, Heikki Linnakangas <hlinn...@iki.fi> wrote:
> visibilitymap_truncate is actually also wrong, in a different way. The
> truncation WAL record is written only after the VM (and FSM) are truncated.
> But visibilitymap_truncate() has already modified and dirtied the page. If
> the VM page change is flushed to disk before the WAL record, and you crash,
> you might have a torn VM page and a checksum failure.
Right, I missed the problem with checksums.
> Simply replacing the MarkBufferDirtyHint() call with MarkBufferDirty() in
> FreeSpaceMapTruncateRel would have the same issue. If you call
> MarkBufferDirty(), you must WAL-log the change, and also set the page's LSN
> to make sure the WAL record is flushed first.
I'm not sure I fully understand this. If the page is flushed before the WAL
record, we might have a truncated FSM where as the relation truncation is
not replayed. But that's not the same problem, right? I mean, you might
have an FSM which is not accurate, but it won't at the least return the
blocks outside the range. Having said that, I agree your proposed changes
are more robust.
BTW any thoughts on race-condition on the primary? Comments at
MarkBufferDirtyHint() seems to suggest that a race condition is possible
which might leave the buffer without the DIRTY flag, but I'm not sure if
that can only happen when the page is locked in shared mode. While most of
the reports so far involved standbys, and the bug can also hit a standalone
master during crash recovery, I wonder if a race can occur even on a live
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services