On Fri, Oct 7, 2016 at 2:59 AM, Pavan Deolasee <pavan.deola...@gmail.com> wrote:
> I investigated a bug report from one of our customers and it looked very
> similar to previous bug reports here [1], [2], [3] (and probably more). In
> these reports, the error looks something like this:
> ERROR:  could not read block 28991 in file "base/16390/572026": read only 0
> of 8192 bytes
> I traced it to the following code in MarkBufferDirtyHint().
> freespace.c freely uses MarkBufferDirtyHint() whenever changes are made to
> the FSM. I think that's usually alright because FSM changes are not WAL
> logged and if FSM ever returns a block with less free space than the caller
> needs, the caller is usually prepared to update the FSM and request for a
> new block. But if it returns a block that is outside the size of the
> relation, then we've a trouble. The very next ReadBuffer() fails to handle
> such a block and throws the error.

To be honest, I have been seeing a very similar issue for a couple of
weeks now on some test beds on a couple of relations involving a
promoted standby, this error happening more frequently for relations
having a more aggressive autovacuum setup. I did not take the time to
look at that seriously, and I am very glad to see this email.

> When a relation is truncated, the FSM is truncated too to remove references
> to the heap blocks that are being truncated. But since the FSM buffer may
> not be marked DIRTY on the standby, if the buffer gets evicted from the
> buffer cache, the on-disk copy of the FSM page may be left with references
> to the truncated heap pages. When the standby is later promoted to be the
> master, and an insert/update is attempted to the table, the FSM may return a
> block that is outside the valid range of the relation. That results in the
> said error.


> I believe the fix is very simple. The FSM change during truncation is
> critical and the buffer must be marked by MarkBufferDirty() i.e. those
> changes must make to the disk. I think it's alright not to WAL log them
> because XLOG_SMGR_TRUNCATE will redo() them if a crash occurs. But it must
> not be lost across a checkpoint. Also, since it happens only during relation
> truncation, I don't see any problem from performance perspective.

Agreed. I happen to notice that VM is similalry careful when it comes
to truncate it (visibilitymap_truncate).

> What bothers me is how to fix the problem for already affected standbys. If
> the FSM for some table is already corrupted at the standby, users won't
> notice it until the standby is promoted to be the new master. If the standby
> starts throwing errors suddenly after failover, it will be a very bad
> situation for the users, like we noticed with our customers. The fix is
> simple and users can just delete the FSM (and VACUUM the table), but that
> doesn't sound nice and they would not know until they see the problem.

+   /*
+    * See comments in GetPageWithFreeSpace about handling outside the valid
+    * range blocks
+    */
+   nblocks = RelationGetNumberOfBlocks(rel);
+   while (target_block >= nblocks && target_block != InvalidBlockNumber)
+   {
+       target_block = RecordAndGetPageWithFreeSpace(rel, target_block, 0,
+               spaceNeeded);
+   }
Hm. This is just a workaround. Even if things are done this way the
FSM will remain corrupted. And isn't that going to break once the
relation is extended again? I'd suggest instead putting in the release
notes a query that allows one to analyze what are the relations broken
and directly have them fixed. That's annoying, but it would be really
better than a workaround. One idea here is to use pg_freespace() and
see if it returns a non-zero value for an out-of-range block on a

I have also been thinking about the comment you added and we could
just do something like that:
+       /*
+        * Mark the buffer still holding references to truncated blocks as
+        * dirty to be sure that this makes it to disk and is kept consistent
+        * with the truncated relation.
+        */
+       MarkBufferDirty(buf)

> It might also be a good idea to inspect other callers of
> MarkBufferDirtyHint() and see if any of them is vulnerable, especially from
> standby perspective. I did one round, and couldn't see another problem.

Haven't looked at those yet, will do so tomorrow.

At the same time, I have translated your script into a TAP test, I
found that more useful when testing..

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to