Hi,

On 2014-12-15 10:15:30 -0500, Tom Lane wrote:
> The CLOBBER_CACHE_ALWAYS buildfarm members occasionally fail in
> contrib/test_decoding with
>
> TRAP: FailedAssertion("!(((bool)((relation)->rd_refcnt == 0)))", File: 
> "relcache.c", Line: 1981)
>
> for example here:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=markhor&dt=2014-12-14%2005%3A50%3A09
> and here:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=markhor&dt=2014-12-13%2021%3A26%3A05

Yes, I've been trying to debug that. I've finally managed to reproduce
locally. Unfortunately it's not directly reproducible on my laptop...

A fair amount of tinkering later I've found out that it's not actually
CLOBBER_CACHE_ALWAYS itself that triggers the problem but catchup
interrupts. It triggers with CLOBBER_CACHE_ALWAYS because that happens
to make parts of the code slow enough to not reach a ordinary
AcceptInvalidationMessages() in time.

The problem is that in the added regression test there can be a
situation where there's a relcache entry that's *not* currently visible
but still referenced. Consider:

SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 
'test_decoding');
CREATE TABLE somechange(id serial primary key);
INSERT INTO changeresult
    SELECT data FROM pg_logical_slot_get_changes(...);

While get_changes stores it data into a tuplestore there can be a moment
where 'changeresult' has a refcnt > 0 due to the INSERT, but is
invisible because we're using a historic snapshot from when changeresult
doesn't exist.

Without catchup invalidations and/or an outside reference to a relation
that's normally not a problem because it won't get reloaded from the
caches at an inappropriate time while invisible. Until a few weeks ago
there was no regression test covering that case which is why these
crashes are only there now.

It triggers via:
RelationCacheInvalidate() ->
  RelationClearRelation(relation, true) ->
        /* Build temporary entry, but don't link it into hashtable */
        newrel = RelationBuildDesc(save_relid, false);
        if (newrel == NULL)
        {
            /* Should only get here if relation was deleted */
            RelationCacheDelete(relation);
            RelationDestroyRelation(relation, false);
            elog(ERROR, "relation %u deleted while still in use", save_relid);
        }

That path is only hit while refcnt > 0

RelationDestroyRelation() has
    Assert(RelationHasReferenceCountZero(relation));

So that path doesn't really work... Without assertions we'd "just" get a
transient ERROR. I think that path should generally loose the
RelationDestroyRelation() - if it's referenced that's surely not the
right thing to do.  I'm not actually convinced logical decoding is the
only way that assert can be reached.

Since logical decoding happens inside a subtransaction (when called via
SQL) and invalidate caches one relatively straightforward way to fix
this would be to add something roughly akin to:
     Assert(!relation->rd_isvalid)
     /*
      * This should only happen when we're doing logical decoding where
      * it can happen when [above].
      */
     if (HistoricSnapshotActive())
         return;

There's no chance that such a entry could survive too long in an invalid
state (minus preexisting bugs independent of decoding) since we hit the
same paths that subtransaction abort hits.

Greetings,

Andres Freund

--
 Andres Freund                     http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to