On 2018-08-28 19:56:58 -0700, Andres Freund wrote: > Hi Everyone, > > > Tom, I think this could use your eyes. > > > On 2018-08-28 00:52:13 -0700, Andres Freund wrote: > > I've a few leads that I'm currently testing out. One observation I think > > might be crucial is that the problem, in Tomas' testcase with just > > VACUUM FULL of pg_class and INSERTs into another rel, appears to only > > happen if autovacuum is enabled. The crashes are always immediately > > after autovacuum did an analyze of the relation. > > I've run this with autovacuum disabled for quite a while, without the > problem re-occuring. It does however re-occur with pgbench running an > ANALYZE on the user defined table. > > > > What I see on the apply side, I think, that invalidation records for > > contents of pg_class and pg_class itself are intermixed. > > I've not 100% figured this out yet, and I need to switch to another task > for a bit, but what I think is happening here is that we have a > relatively fundamental issue with how rewrites of catalog tables are > handled. A slightly instrumented version of postgres shows the > following: > > 2018-08-29 04:30:17.515 CEST [19551] LOG: statement: INSERT INTO t(b,c) > SELECT i,i FROM generate_series(1,59) s(i); > ... > 2018-08-29 04:30:17.559 CEST [19554] LOG: statement: vacuum full pg_class; > 2018-08-29 04:30:17.571 CEST [19557] LOG: swapping mapped 1259/1281520 w > 1281526/1281526 > 2018-08-29 04:30:17.571 CEST [19557] STATEMENT: vacuum full pg_class; > 2018-08-29 04:30:17.572 CEST [19557] LOG: reindexing 2662 to 1281529 on > 1259/1281526 > 2018-08-29 04:30:17.572 CEST [19557] STATEMENT: vacuum full pg_class; > 2018-08-29 04:30:17.576 CEST [19557] LOG: reindexing 2663 to 1281530 on > 1259/1281526 > 2018-08-29 04:30:17.576 CEST [19557] STATEMENT: vacuum full pg_class; > 2018-08-29 04:30:17.578 CEST [19557] LOG: reindexing 3455 to 1281531 on > 1259/1281526 > 2018-08-29 04:30:17.578 CEST [19557] STATEMENT: vacuum full pg_class; > 2018-08-29 04:30:17.582 CEST [19557] LOG: releasing lock on 1259 > 2018-08-29 04:30:17.582 CEST [19557] STATEMENT: vacuum full pg_class; > 2018-08-29 04:30:17.582 CEST [19557] LOG: duration: 87.389 ms > 2018-08-29 04:30:17.582 CEST [19557] LOG: statement: vacuum full pg_class; > 2018-08-29 04:30:17.583 CEST [19563] LOG: duration: 630.481 ms > > 2018-08-29 04:30:17.584 CEST [19563] LOG: statement: analyze t; > 2018-08-29 04:30:17.597 CEST [19555] LOG: swapping mapped 1259/1281526 w > 1281532/1281532 > 2018-08-29 04:30:17.597 CEST [19555] STATEMENT: vacuum full pg_class; > 2018-08-29 04:30:17.597 CEST [19555] LOG: reindexing 2662 to 1281535 on > 1259/1281532 > 2018-08-29 04:30:17.597 CEST [19555] STATEMENT: vacuum full pg_class; > 2018-08-29 04:30:17.599 CEST [19555] LOG: reindexing 2663 to 1281536 on > 1259/1281532 > 2018-08-29 04:30:17.599 CEST [19555] STATEMENT: vacuum full pg_class; > 2018-08-29 04:30:17.601 CEST [19555] LOG: reindexing 3455 to 1281537 on > 1259/1281532 > 2018-08-29 04:30:17.601 CEST [19555] STATEMENT: vacuum full pg_class; > 2018-08-29 04:30:17.605 CEST [19555] LOG: releasing lock on 1259 > 2018-08-29 04:30:17.605 CEST [19555] STATEMENT: vacuum full pg_class; > 2018-08-29 04:30:17.605 CEST [19551] WARNING: could not read block 3 in file > "base/16384/1281529": read only 0 of 8192 bytes > (it's a warning because the process SIGSTOPs itself afterwards, for > easier debugging) > > So what we can see is that a first VACUUM FULL of pg_class builds a new > pg_class_oid_index/2662 index under filefilenode 1281526. Then a > *second* VACUUM FULL again reindexes 2663, this time to 1281532. > > But *after* the VACUUM FULL has finished and released the lock (that's > printed just before the UnlockRelationIdForSession in vacuum_rel()), we > can see that the INSERT from pid 19551 still accesses the OLD index, > from *before* the second VACUUM FULL. Which at that point doesn't exist > anymore. > > I think the reason for that is that an invalidation for the relmapper > changes due to the second VACUUM has been queued, but currently the > invalidation processing hasn't yet gotten to that entry yet. Instead > currently the entry for the user defined table is being rebuilt, which > then tries to access pg_class, but under its "old" relfilenode. Which > then fails. The fact that inex scans trigger this problem isn't > relevant - the heap entry for pg_class also refers to the old entry, as > I've verified using gdb. > > > I'm unclear on how can fix this, as long as we rebuild cache entries > during invalidation, rather than during access. For most cache entries > it's not a problem to rebuild them relying on other outdated content, > we'll just rebuild them again later. But for things like the relation > mapper that's not true, unfortunately.
Fwiw, if I use gdb to issue InvalidateSystemCaches(), RelationInitPhysicalAddr(sysscan->irel) makes the relcache entry have the correct relfilenode, and restarting the relcache invalidation (by judicious use of 'ret') "fixes" the issue. Greetings, Andres Freund