On Fri, Feb 16, 2024 at 11:43 AM Amit Kapila <amit.kapil...@gmail.com> wrote: > > Thanks for noticing this. I have pushed all your debug patches. Let's > hope if there is a BF failure next time, we can gather enough > information to know the reason of the same. >
There is a new BF failure [1] after adding these LOGs and I think I know what is going wrong. First, let's look at standby LOGs: 2024-02-16 06:18:18.442 UTC [241414][client backend][2/14:0] DEBUG: segno: 4 of purposed restart_lsn for the synced slot, oldest_segno: 4 available 2024-02-16 06:18:18.443 UTC [241414][client backend][2/14:0] DEBUG: xmin required by slots: data 0, catalog 741 2024-02-16 06:18:18.443 UTC [241414][client backend][2/14:0] LOG:mote could not sync slot information as reslot precedes local slot: remote slot "lsub1_slot": LSN (0/4000168), catalog xmin (739) local slot: LSN (0/4000168), catalog xmin (741) So, from the above LOG, it is clear that the remote slot's catalog xmin (739) precedes the local catalog xmin (741) which makes the sync on standby to not complete. Next, let's look at the LOG from the primary during the nearby time: 2024-02-16 06:18:11.354 UTC [238037][autovacuum worker][5/17:0] DEBUG: analyzing "pg_catalog.pg_depend" 2024-02-16 06:18:11.360 UTC [238037][autovacuum worker][5/17:0] DEBUG: "pg_depend": scanned 13 of 13 pages, containing 1709 live rows and 0 dead rows; 1709 rows in sample, 1709 estimated total rows ... 2024-02-16 06:18:11.372 UTC [238037][autovacuum worker][5/0:0] DEBUG: Autovacuum VacuumUpdateCosts(db=1, rel=14050, dobalance=yes, cost_limit=200, cost_delay=2 active=yes failsafe=no) 2024-02-16 06:18:11.372 UTC [238037][autovacuum worker][5/19:0] DEBUG: analyzing "information_schema.sql_features" 2024-02-16 06:18:11.377 UTC [238037][autovacuum worker][5/19:0] DEBUG: "sql_features": scanned 8 of 8 pages, containing 756 live rows and 0 dead rows; 756 rows in sample, 756 estimated total rows It shows us that autovacuum worker has analyzed catalog table and for updating its statistics in pg_statistic table, it would have acquired a new transaction id. Now, after the slot creation, a new transaction id that has updated the catalog is generated on primary and would have been replication to standby. Due to this catalog_xmin of primary's slot would precede standby's catalog_xmin and we see this failure. As per this theory, we should disable autovacuum on primary to avoid updates to catalog_xmin values. [1] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2024-02-16%2006%3A12%3A59 -- With Regards, Amit Kapila.