I investigated a bug report from one of our customers and it looked very
similar to previous bug reports here [1], [2], [3] (and probably more). In
these reports, the error looks something like this:

ERROR:  could not read block 28991 in file "base/16390/572026": read only 0
of 8192 bytes

I traced it to the following code in MarkBufferDirtyHint(). The function
returns without setting the DIRTY bit on the standby:

3413             /*
3414              * If we're in recovery we cannot dirty a page because of
a hint.
3415              * We can set the hint, just not dirty the page as a
result so the
3416              * hint is lost when we evict the page or shutdown.
3417              *
3418              * See src/backend/storage/page/README for longer
3419              */
3420             if (RecoveryInProgress())
3421                 return;

freespace.c freely uses MarkBufferDirtyHint() whenever changes are made to
the FSM. I think that's usually alright because FSM changes are not WAL
logged and if FSM ever returns a block with less free space than the caller
needs, the caller is usually prepared to update the FSM and request for a
new block. But if it returns a block that is outside the size of the
relation, then we've a trouble. The very next ReadBuffer() fails to handle
such a block and throws the error.

When a relation is truncated, the FSM is truncated too to remove references
to the heap blocks that are being truncated. But since the FSM buffer may
not be marked DIRTY on the standby, if the buffer gets evicted from the
buffer cache, the on-disk copy of the FSM page may be left with references
to the truncated heap pages. When the standby is later promoted to be the
master, and an insert/update is attempted to the table, the FSM may return
a block that is outside the valid range of the relation. That results in
the said error.

Once this was clear, it was easy to put together a fully reproducible test
case. See the attached script; you'll need to adjust to your environment.
This affects all releases starting 9.3 and the script can reproduce the
problem on all these releases.

I believe the fix is very simple. The FSM change during truncation is
critical and the buffer must be marked by MarkBufferDirty() i.e. those
changes must make to the disk. I think it's alright not to WAL log them
because XLOG_SMGR_TRUNCATE will redo() them if a crash occurs. But it must
not be lost across a checkpoint. Also, since it happens only during
relation truncation, I don't see any problem from performance perspective.

What bothers me is how to fix the problem for already affected standbys. If
the FSM for some table is already corrupted at the standby, users won't
notice it until the standby is promoted to be the new master. If the
standby starts throwing errors suddenly after failover, it will be a very
bad situation for the users, like we noticed with our customers. The fix is
simple and users can just delete the FSM (and VACUUM the table), but that
doesn't sound nice and they would not know until they see the problem.

One idea is to always check if the block returned by the FSM is outside the
range and discard such blocks after setting the FSM (attached patch does
that). The problem with that approach is that RelationGetNumberOfBlocks()
is not really cheap and invoking it everytime FSM is consulted may not be a
bright idea. Can we cache that value in the RelationData or some such place
(BulkInsertState?) and use that as a hint for quickly checking if the block
is (potentially) outside the range and discard it? Any other ideas?

The other concern I've and TBH that's what I initially thought as the real
problem, until I saw RecoveryInProgress() specific code, is: can this also
affect stand-alone masters? The comments at MarkBufferDirtyHint() made me
think so:

3358  * 3. This function does not guarantee that the buffer is always
marked dirty
3359  *    (due to a race condition), so it cannot be used for important

So I was working with a theory that somehow updates to the FSM page are
lost because the race mentioned in the comment actually kicks in. But I'm
not sure if the race is only possible when the caller is holding a SHARE
lock on the buffer. When the FSM is truncated, the caller holds an
EXCLUSIVE lock on the FSM buffer. So probably we're safe. I could not
reproduce the issue on a stand-alone master. But probably worth checking.

It might also be a good idea to inspect other callers of
MarkBufferDirtyHint() and see if any of them is vulnerable, especially from
standby perspective. I did one round, and couldn't see another problem.



 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment: pg_fsm_corruption_fix.patch
Description: Binary data

Attachment: pg_trigger_fsm_error.sh
Description: Bourne shell script

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to