On 2018-08-28 19:56:58 -0700, Andres Freund wrote:
> Hi Everyone,
> 
> 
> Tom, I think this could use your eyes.
> 
> 
> On 2018-08-28 00:52:13 -0700, Andres Freund wrote:
> > I've a few leads that I'm currently testing out. One observation I think
> > might be crucial is that the problem, in Tomas' testcase with just
> > VACUUM FULL of pg_class and INSERTs into another rel, appears to only
> > happen if autovacuum is enabled. The crashes are always immediately
> > after autovacuum did an analyze of the relation.
> 
> I've run this with autovacuum disabled for quite a while, without the
> problem re-occuring. It does however re-occur with pgbench running an
> ANALYZE on the user defined table.
> 
> 
> > What I see on the apply side, I think, that invalidation records for
> > contents of pg_class and pg_class itself are intermixed.
> 
> I've not 100% figured this out yet, and I need to switch to another task
> for a bit, but what I think is happening here is that we have a
> relatively fundamental issue with how rewrites of catalog tables are
> handled.  A slightly instrumented version of postgres shows the
> following:
> 
> 2018-08-29 04:30:17.515 CEST [19551] LOG:  statement: INSERT INTO t(b,c) 
> SELECT i,i FROM generate_series(1,59) s(i);
> ...
> 2018-08-29 04:30:17.559 CEST [19554] LOG:  statement: vacuum full pg_class;
> 2018-08-29 04:30:17.571 CEST [19557] LOG:  swapping mapped 1259/1281520 w 
> 1281526/1281526
> 2018-08-29 04:30:17.571 CEST [19557] STATEMENT:  vacuum full pg_class;
> 2018-08-29 04:30:17.572 CEST [19557] LOG:  reindexing 2662 to 1281529 on 
> 1259/1281526
> 2018-08-29 04:30:17.572 CEST [19557] STATEMENT:  vacuum full pg_class;
> 2018-08-29 04:30:17.576 CEST [19557] LOG:  reindexing 2663 to 1281530 on 
> 1259/1281526
> 2018-08-29 04:30:17.576 CEST [19557] STATEMENT:  vacuum full pg_class;
> 2018-08-29 04:30:17.578 CEST [19557] LOG:  reindexing 3455 to 1281531 on 
> 1259/1281526
> 2018-08-29 04:30:17.578 CEST [19557] STATEMENT:  vacuum full pg_class;
> 2018-08-29 04:30:17.582 CEST [19557] LOG:  releasing lock on 1259
> 2018-08-29 04:30:17.582 CEST [19557] STATEMENT:  vacuum full pg_class;
> 2018-08-29 04:30:17.582 CEST [19557] LOG:  duration: 87.389 ms
> 2018-08-29 04:30:17.582 CEST [19557] LOG:  statement: vacuum full pg_class;
> 2018-08-29 04:30:17.583 CEST [19563] LOG:  duration: 630.481 ms
> 
> 2018-08-29 04:30:17.584 CEST [19563] LOG:  statement: analyze t;
> 2018-08-29 04:30:17.597 CEST [19555] LOG:  swapping mapped 1259/1281526 w 
> 1281532/1281532
> 2018-08-29 04:30:17.597 CEST [19555] STATEMENT:  vacuum full pg_class;
> 2018-08-29 04:30:17.597 CEST [19555] LOG:  reindexing 2662 to 1281535 on 
> 1259/1281532
> 2018-08-29 04:30:17.597 CEST [19555] STATEMENT:  vacuum full pg_class;
> 2018-08-29 04:30:17.599 CEST [19555] LOG:  reindexing 2663 to 1281536 on 
> 1259/1281532
> 2018-08-29 04:30:17.599 CEST [19555] STATEMENT:  vacuum full pg_class;
> 2018-08-29 04:30:17.601 CEST [19555] LOG:  reindexing 3455 to 1281537 on 
> 1259/1281532
> 2018-08-29 04:30:17.601 CEST [19555] STATEMENT:  vacuum full pg_class;
> 2018-08-29 04:30:17.605 CEST [19555] LOG:  releasing lock on 1259
> 2018-08-29 04:30:17.605 CEST [19555] STATEMENT:  vacuum full pg_class;
> 2018-08-29 04:30:17.605 CEST [19551] WARNING:  could not read block 3 in file 
> "base/16384/1281529": read only 0 of 8192 bytes
> (it's a warning because the process SIGSTOPs itself afterwards, for
> easier debugging)
> 
> So what we can see is that a first VACUUM FULL of pg_class builds a new
> pg_class_oid_index/2662 index under filefilenode 1281526.  Then a
> *second* VACUUM FULL again reindexes 2663, this time to 1281532.
> 
> But *after* the VACUUM FULL has finished and released the lock (that's
> printed just before the UnlockRelationIdForSession in vacuum_rel()), we
> can see that the INSERT from pid 19551 still accesses the OLD index,
> from *before* the second VACUUM FULL.  Which at that point doesn't exist
> anymore.
> 
> I think the reason for that is that an invalidation for the relmapper
> changes due to the second VACUUM has been queued, but currently the
> invalidation processing hasn't yet gotten to that entry yet. Instead
> currently the entry for the user defined table is being rebuilt, which
> then tries to access pg_class, but under its "old" relfilenode. Which
> then fails.  The fact that inex scans trigger this problem isn't
> relevant - the heap entry for pg_class also refers to the old entry, as
> I've verified using gdb.
> 
> 
> I'm unclear on how can fix this, as long as we rebuild cache entries
> during invalidation, rather than during access. For most cache entries
> it's not a problem to rebuild them relying on other outdated content,
> we'll just rebuild them again later.  But for things like the relation
> mapper that's not true, unfortunately.

Fwiw, if I use gdb to issue InvalidateSystemCaches(),
RelationInitPhysicalAddr(sysscan->irel) makes the relcache entry have
the correct relfilenode, and restarting the relcache invalidation (by
judicious use of 'ret') "fixes" the issue.

Greetings,

Andres Freund

Reply via email to