Hi, On 2025-01-24 14:56:45 -0500, Tom Lane wrote: > Alexander Lakhin <exclus...@gmail.com> writes: > > There is also a reference to a discussion of the failure there: > > https://www.postgresql.org/message-id/657815a2-5a89-fcc1-1c9d-d77a6986b...@gmail.com > > (In short, I observed that that test suffers from bgwriter's activity.) > > Hmm, that's an interesting theory, but it fails to explain why > only drongo is showing this failure. It's not particularly slow > compared to some other BF animals.
One interesting bit is that both on sucessful and failing runs the preceding test is quite slow: [14:18:13.386](3.192s) ok 25 - row_removal_activeslot on standby created ### Reloading node "standby" # Running: pg_ctl -D C:\\prog\\bf\\root\\HEAD\\pgsql.build/testrun/recovery/035_standby_logical_decoding\\data/t_035_standby_logical_decoding_standby_data/pgdata reload server signaled #### Begin standard error psql:<stdin>:1: INFO: vacuuming "testdb.pg_catalog.pg_class" psql:<stdin>:1: INFO: finished vacuuming "testdb.pg_catalog.pg_class": index scans: 1 pages: 0 removed, 11 remain, 11 scanned (100.00% of total) tuples: 4 removed, 419 remain, 0 are dead but not yet removable removable cutoff: 760, which was 0 XIDs old when operation ended frozen: 0 pages from table (0.00% of total) had 0 tuples frozen visibility map: 11 pages set all-visible, 10 pages set all-frozen (0 were all-visible) index scan needed: 3 pages from table (27.27% of total) had 5 dead item identifiers removed index "pg_class_oid_index": pages: 4 in total, 0 newly deleted, 0 currently deleted, 0 reusable index "pg_class_relname_nsp_index": pages: 5 in total, 0 newly deleted, 0 currently deleted, 0 reusable index "pg_class_tblspc_relfilenode_index": pages: 2 in total, 0 newly deleted, 0 currently deleted, 0 reusable avg read rate: 11.684 MB/s, avg write rate: 23.369 MB/s buffer usage: 90 hits, 8 reads, 16 dirtied WAL usage: 28 records, 15 full page images, 123107 bytes system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s #### End standard error Waiting for replication conn standby's replay_lsn to pass 0/403FA80 on primary done [14:18:48.597](35.210s) ok 26 - inactiveslot slot invalidation is logged with vacuum on pg_class That's really quite slow. Looking at a few other instances it always just seems to take a few seconds. Obviously running for longer does make it more likely for checkpoint/bgwriter to log running xacts. Unfortunately there's no timing visible between the vacuum and waiting for replication... And the passing runs don't seem to have the server logs. Because of that I'm looking at a failure: [05:23:26.630](3.831s) ok 21 - row_removal_activeslot on standby created ### Reloading node "standby" # Running: pg_ctl -D C:\\prog\\bf\\root\\REL_16_STABLE\\pgsql.build/testrun/recovery/035_standby_logical_decoding\\data/t_035_standby_logical_decoding_standby_data/pgdata reload server signaled #### Begin standard error psql:<stdin>:1: INFO: vacuuming "testdb.pg_catalog.pg_class" psql:<stdin>:1: INFO: finished vacuuming "testdb.pg_catalog.pg_class": index scans: 1 pages: 0 removed, 11 remain, 11 scanned (100.00% of total) tuples: 4 removed, 417 remain, 0 are dead but not yet removable removable cutoff: 748, which was 0 XIDs old when operation ended frozen: 0 pages from table (0.00% of total) had 0 tuples frozen index scan needed: 3 pages from table (27.27% of total) had 5 dead item identifiers removed index "pg_class_oid_index": pages: 4 in total, 0 newly deleted, 0 currently deleted, 0 reusable index "pg_class_relname_nsp_index": pages: 5 in total, 0 newly deleted, 0 currently deleted, 0 reusable index "pg_class_tblspc_relfilenode_index": pages: 2 in total, 0 newly deleted, 0 currently deleted, 0 reusable avg read rate: 0.264 MB/s, avg write rate: 0.495 MB/s buffer usage: 77 hits, 8 misses, 15 dirtied WAL usage: 28 records, 15 full page images, 122841 bytes system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.23 s #### End standard error Waiting for replication conn standby's replay_lsn to pass 0/403E648 on primary done [05:23:54.630](28.000s) ok 22 - inactiveslot slot invalidation is logged with vacuum on pg_class [05:23:54.643](0.013s) not ok 23 - activeslot slot invalidation is logged with vacuum on pg_class The vacuum only starts at: 05:23:49.294, i.e. after ~23s. 2024-09-06 05:23:49.294 UTC [552:1] [unknown] LOG: connection received: host=127.0.0.1 port=51838 2024-09-06 05:23:49.301 UTC [552:2] [unknown] LOG: connection authenticated: identity="EC2AMAZ-P7KGG90\\pgrunner" method=sspi (C:/prog/bf/root/REL_16_STABLE/pgsql.build/testrun/recovery/035_standby_logical_decoding/data/t_035_standby_logical_decoding_primary_data/pgdata/pg_hba.conf:2) 2024-09-06 05:23:49.301 UTC [552:3] [unknown] LOG: connection authorized: user=pgrunner database=testdb application_name=035_standby_logical_decoding.pl 2024-09-06 05:23:49.355 UTC [552:4] 035_standby_logical_decoding.pl LOG: statement: VACUUM verbose pg_class; 2024-09-06 05:23:49.607 UTC [552:5] 035_standby_logical_decoding.pl LOG: statement: INSERT INTO flush_wal DEFAULT VALUES; 2024-09-06 05:23:49.609 UTC [552:6] 035_standby_logical_decoding.pl LOG: disconnection: session time: 0:00:00.322 user=pgrunner database=testdb host=127.0.0.1 port=51838 Before that we see: 2024-09-06 05:23:27.683 UTC [168:4] 035_standby_logical_decoding.pl LOG: statement: SELECT xmin IS NULL AND catalog_xmin IS NULL FROM pg_catalog.pg_replication_slots WHERE slot_name = 'primary_physical'; 2024-09-06 05:23:31.447 UTC [5836:4] 035_standby_logical_decoding.pl LOG: statement: select pg_snapshot_xmin(pg_current_snapshot()); 2024-09-06 05:23:36.963 UTC [4908:4] 035_standby_logical_decoding.pl LOG: statement: CREATE TABLE conflict_test(x integer, y text); 2024-09-06 05:23:36.972 UTC [4908:5] 035_standby_logical_decoding.pl LOG: statement: DROP TABLE conflict_test; 2024-09-06 05:23:45.192 UTC [4240:4] 035_standby_logical_decoding.pl LOG: statement: SELECT (select pg_snapshot_xmin(pg_current_snapshot())::text::int - 746) > 0 Between 05:23:30.797 and 05:23:49.601 nothing is logged on the standby. There's a lot of time between disconnects and connects in the test around that time: 2024-09-06 05:23:31.591 UTC [5836:5] 035_standby_logical_decoding.pl LOG: disconnection: session time: 0:00:00.240 user=pgrunner database=testdb host=127.0.0.1 port=51817 2024-09-06 05:23:36.897 UTC [4908:1] [unknown] LOG: connection received: host=127.0.0.1 port=51829 2024-09-06 05:23:37.032 UTC [4908:6] 035_standby_logical_decoding.pl LOG: disconnection: session time: 0:00:00.142 user=pgrunner database=testdb host=127.0.0.1 port=51829 2024-09-06 05:23:45.135 UTC [4240:1] [unknown] LOG: connection received: host=127.0.0.1 port=51836 2024-09-06 05:23:45.272 UTC [4240:5] 035_standby_logical_decoding.pl LOG: disconnection: session time: 0:00:00.152 user=pgrunner database=testdb host=127.0.0.1 port=51836 2024-09-06 05:23:49.294 UTC [552:1] [unknown] LOG: connection received: host=127.0.0.1 port=51838 Even though afaict the perl script doesn't do anything else between those connections. > BTW, looking more closely at the failures, I notice that some of them > are 'with vacuum on pg_authid' but others are 'with vacuum on pg_class'. > So those are two different steps in the test script, and there are > other steps that are very similar but have not failed. Curiouser > and curiouser. I think that makes some sense: "Scenario 1": Doesn't have the race to the same degree, because a VACUUM FULL on pg_class is used, which prevents logical decoding from progressing after it started (due to the logged AEL at the start of VACFULL). "Scenario 2" and "Scenario 3": Nearly the same test, except once with a database-local relation and once with a shared rel. "Scenario 4": Doesn't expect to trigger a conflict, so the race doesn't matter. "Scenario 5": Theoretically racy, but the window seems much smaller, it looks like the conflict has multiple chances to be triggered due to each of the updates happening in a separate session (and thus transaction). Greetings, Andres Freund