Hi,
On 2022-03-30 21:04:07 -0700, Andres Freund wrote:
> On 2022-03-30 20:35:25 -0700, Peter Geoghegan wrote:
> > On Wed, Mar 30, 2022 at 8:28 PM Andres Freund <[email protected]> wrote:
> > > I triggered twice now, but it took a while longer the second time.
> >
> > Great.
> >
> > I wonder if you can get an RR recording...
>
> Started it, but looks like it's too slow.
>
> (gdb) p MyProcPid
> $1 = 2172500
>
> (gdb) p vacrel->NewRelfrozenXid
> $3 = 717
> (gdb) p vacrel->relfrozenxid
> $4 = 717
> (gdb) p OldestXmin
> $5 = 5112
> (gdb) p aggressive
> $6 = false
I added a bunch of debug elogs to see what sets *frozenxid_updated to true.
(gdb) p *vacrel
$1 = {rel = 0x7fe24f3e0148, indrels = 0x7fe255c17ef8, nindexes = 2, aggressive
= false, skipwithvm = true, failsafe_active = false,
consider_bypass_optimization = true, do_index_vacuuming = true,
do_index_cleanup = true, do_rel_truncate = true, bstrategy = 0x7fe255bb0e28,
pvs = 0x0,
relfrozenxid = 717, relminmxid = 6, old_live_tuples = 42, OldestXmin = 20751,
vistest = 0x7fe255058970 <GlobalVisSharedRels>, FreezeLimit = 4244988047,
MultiXactCutoff = 4289967302, NewRelfrozenXid = 717, NewRelminMxid = 6,
skippedallvis = false, relnamespace = 0x7fe255c17bf8 "pg_catalog",
relname = 0x7fe255c17cb8 "pg_database", indname = 0x0, blkno = 4294967295,
offnum = 0, phase = VACUUM_ERRCB_PHASE_SCAN_HEAP, verbose = false,
dead_items = 0x7fe255c131d0, rel_pages = 8, scanned_pages = 8, removed_pages
= 0, lpdead_item_pages = 0, missed_dead_pages = 0, nonempty_pages = 8,
new_rel_tuples = 124, new_live_tuples = 42, indstats = 0x7fe255c18320,
num_index_scans = 0, tuples_deleted = 0, lpdead_items = 0, live_tuples = 42,
recently_dead_tuples = 82, missed_dead_tuples = 0}
But the debug elog reports that
relfrozenxid updated 714 -> 717
relminmxid updated 1 -> 6
Tthe problem is that the crashing backend reads the relfrozenxid/relminmxid
from the shared relcache init file written by another backend:
2022-03-30 21:10:47.626 PDT [2625038][autovacuum worker][6/433:0][] LOG:
automatic vacuum of table
"contrib_regression_postgres_fdw.pg_catalog.pg_database": index scans: 1
pages: 0 removed, 8 remain, 8 scanned (100.00% of total)
tuples: 4 removed, 114 remain, 72 are dead but not yet removable
removable cutoff: 20751, older by 596 xids when operation ended
new relfrozenxid: 717, which is 3 xids ahead of previous value
new relminmxid: 6, which is 5 mxids ahead of previous value
index scan needed: 3 pages from table (37.50% of total) had 8 dead item
identifiers removed
index "pg_database_datname_index": pages: 2 in total, 0 newly deleted,
0 currently deleted, 0 reusable
index "pg_database_oid_index": pages: 6 in total, 0 newly deleted, 2
currently deleted, 2 reusable
I/O timings: read: 0.050 ms, write: 0.102 ms
avg read rate: 209.860 MB/s, avg write rate: 76.313 MB/s
buffer usage: 42 hits, 22 misses, 8 dirtied
WAL usage: 13 records, 5 full page images, 33950 bytes
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
...
2022-03-30 21:10:47.772 PDT [2625043][autovacuum worker][:0][] DEBUG:
InitPostgres
2022-03-30 21:10:47.772 PDT [2625043][autovacuum worker][6/0:0][] DEBUG: my
backend ID is 6
2022-03-30 21:10:47.772 PDT [2625043][autovacuum worker][6/0:0][] LOG: reading
shared init file
2022-03-30 21:10:47.772 PDT [2625043][autovacuum worker][6/443:0][] DEBUG:
StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS,
xid/sub>
2022-03-30 21:10:47.772 PDT [2625043][autovacuum worker][6/443:0][] LOG:
reading non-shared init file
This is basically the inverse of a54e1f15 - we read a *newer* horizon. That's
normally fairly harmless - I think.
Perhaps we should just fetch the horizons from the "local" catalog for shared
rels?
Greetings,
Andres Freund