Re: [HACKERS] FSM corruption leading to errors

Michael Paquier Mon, 10 Oct 2016 16:51:11 -0700

On Mon, Oct 10, 2016 at 11:41 PM, Pavan Deolasee
<[email protected]> wrote:
>
>
> On Mon, Oct 10, 2016 at 7:55 PM, Michael Paquier <[email protected]>
> wrote:
>>
>>
>>
>> +   /*
>> +    * See comments in GetPageWithFreeSpace about handling outside the
>> valid
>> +    * range blocks
>> +    */
>> +   nblocks = RelationGetNumberOfBlocks(rel);
>> +   while (target_block >= nblocks && target_block != InvalidBlockNumber)
>> +   {
>> +       target_block = RecordAndGetPageWithFreeSpace(rel, target_block, 0,
>> +               spaceNeeded);
>> +   }
>> Hm. This is just a workaround. Even if things are done this way the
>> FSM will remain corrupted.
>
>
> No, because the code above updates the FSM of those out-of-the range blocks.
> But now that I look at it again, may be this is not correct and it may get
> into an endless loop if the relation is repeatedly extended concurrently.


Ah yes, that's what the call for
RecordAndGetPageWithFreeSpace()/fsm_set_and_search() is for. I missed
that yesterday before sleeping.

>> And isn't that going to break once the
>> relation is extended again?
>
>
> Once the underlying bug is fixed, I don't see why it should break again. I
> added the above code to mostly deal with already corrupt FSMs. May be we can
> just document and leave it to the user to run some correctness checks (see
> below), especially given that the code is not cheap and adds overheads for
> everybody, irrespective of whether they have or will ever have corrupt FSM.

Yep. I'd leave it for the release notes to hold a diagnostic method.
That's annoying, but this has been done in the past like for the
multixact issues..

>> I'd suggest instead putting in the release
>> notes a query that allows one to analyze what are the relations broken
>> and directly have them fixed. That's annoying, but it would be really
>> better than a workaround. One idea here is to use pg_freespace() and
>> see if it returns a non-zero value for an out-of-range block on a
>> standby.
>>
>
> Right, that's how I tested for broken FSMs. A challenge with any such query
> is that if the shared buffer copy of the FSM page is intact, then the query
> won't return problematic FSMs. Of course, if the fix is applied to the
> standby and is restarted, then corrupt FSMs can be detected.

What if you restart the standby, and then do a diagnostic query?
Wouldn't that be enough? (Something just based on
pg_freespace(pg_relation_size(oid) / block_size) != 0)
-- 
Michael


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] FSM corruption leading to errors

Reply via email to