Hello hackers,

I'd like to present a recent failure of 035_standby_logical_decoding on
buildfarm [1]:
### Promoting node "standby"
# Running: pg_ctl --pgdata /home/bf/bf-build/scorpion/HEAD/pgsql.build/testrun/recovery/035_standby_logical_decoding/data/t_035_standby_logical_decoding_standby_data/pgdata --log /home/bf/bf-build/scorpion/HEAD/pgsql.build/testrun/recovery/035_standby_logical_decoding/log/035_standby_logical_decoding_standby.log promote
waiting for server to promote.... done
server promoted
Waiting for replication conn cascading_standby's replay_lsn to pass 0/0648A290 
on standby
done
[10:40:16.996](0.446s) ok 84 - got expected output from SQL decoding session on 
promoted standby
[10:40:16.997](0.001s) # pump_until: process terminated unexpectedly when searching for "(?^s:^.*COMMIT.*COMMIT$)" with stream: ""
[10:40:16.997](0.000s) not ok 85 - got 2 COMMIT from pg_recvlogical output
[10:40:16.997](0.000s) #   Failed test 'got 2 COMMIT from pg_recvlogical output'
#   at 
/home/bf/bf-build/scorpion/HEAD/pgsql/src/test/recovery/t/035_standby_logical_decoding.pl
 line 1036.
[10:40:16.997](0.000s) not ok 86 - got same expected output from pg_recvlogical 
decoding session
[10:40:16.997](0.000s) #   Failed test 'got same expected output from 
pg_recvlogical decoding session'
#   at 
/home/bf/bf-build/scorpion/HEAD/pgsql/src/test/recovery/t/035_standby_logical_decoding.pl
 line 1040.
[10:40:16.998](0.000s) #          got: ''
#     expected: 'BEGIN
# table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
# table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
# table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
# table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
# COMMIT
# BEGIN
# table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
# table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
# table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
# COMMIT'
[10:40:17.009](0.011s) ok 87 - got expected output from SQL decoding session on 
cascading standby
[10:40:17.009](0.000s) ok 88 - got 2 COMMIT from pg_recvlogical output

035_standby_logical_decoding_standby.log contains:
2026-06-03 10:40:16.745 CEST [3955206][startup][29/0:0] LOG: received promote 
request
...
2026-06-03 10:40:16.784 CEST [3955206][startup][29/0:0] LOG: archive recovery 
complete
2026-06-03 10:40:16.862 CEST [3963805][walsender][25/0:0] LOG: starting logical decoding 
for slot "promotion_activeslot"
2026-06-03 10:40:16.862 CEST [3963805][walsender][25/0:0] DETAIL: Streaming transactions committing after 0/06487A20, reading WAL from 0/064879E8. 2026-06-03 10:40:16.862 CEST [3963805][walsender][25/0:0] STATEMENT:  START_REPLICATION SLOT "promotion_activeslot" LOGICAL 0/00000000 ("include-xids" '0', "skip-empty-xacts" '1') 2026-06-03 10:40:16.862 CEST [3963805][walsender][25/0:0] ERROR: requested WAL segment 000000010000000000000006 has already been removed 2026-06-03 10:40:16.862 CEST [3963805][walsender][25/0:0] STATEMENT:  START_REPLICATION SLOT "promotion_activeslot" LOGICAL 0/00000000 ("include-xids" '0', "skip-empty-xacts" '1')
2026-06-03 10:40:16.862 CEST [3963805][walsender][25/0:0] LOG: released logical 
replication slot "promotion_activeslot"
2026-06-03 10:40:16.862 CEST [3963805][walsender][:0] LOG: disconnection: session time: 0:00:00.212 user=bf database=testdb host=[local]
...

That is, walsender requested WAL segment for timeline 1, while in a
successful run, it reads WAL for timeline 2.

I've managed to reproduce this failure with:
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1502,2 +1502,3 @@ StartLogicalReplication(StartReplicationCmd *cmd)
     ReplicationSlotAcquire(cmd->slotname, true, true);
+pg_usleep(200000);

Being executed in a loop, the test fails for me within 100 iterations
(within 10 on a slower VM):
ITERATION 23
# +++ tap check in src/test/recovery +++
t/035_standby_logical_decoding.pl .. 84/? # pump_until: process terminated unexpectedly when searching for "(?^s:^.*COMMIT.*COMMIT$)" with stream: ""

#   Failed test 'got 2 COMMIT from pg_recvlogical output'
#   at t/035_standby_logical_decoding.pl line 1038.

#   Failed test 'got same expected output from pg_recvlogical decoding session'
#   at t/035_standby_logical_decoding.pl line 1042.
#          got: ''
#     expected: 'BEGIN
# table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
...

As far as I can see, the timeline is chosen in logical_read_xlog_page()
depending on the recovery state:
        am_cascading_walsender = RecoveryInProgress();

        if (am_cascading_walsender)
                GetXLogReplayRecPtr(&currTLI);
        else
                currTLI = GetWALInsertionTimeLine();

With diagnostic logging added here, a failed run shows:
2026-06-06 07:23:01.558 UTC walsender[165403] 035_standby_logical_decoding.pl LOG:  starting logical decoding for slot "promotion_activeslot" 2026-06-06 07:23:01.558 UTC walsender[165403] 035_standby_logical_decoding.pl DETAIL:  Streaming transactions committing after 0/0647C7E8, reading WAL from 0/0647C7B0. 2026-06-06 07:23:01.558 UTC walsender[165403] 035_standby_logical_decoding.pl STATEMENT:  START_REPLICATION SLOT "promotion_activeslot" LOGICAL 0/00000000 ("include-xids" '0', "skip-empty-xacts" '1') 2026-06-06 07:23:01.559 UTC walsender[165403] 035_standby_logical_decoding.pl LOG:  !!!logical_read_xlog_page| am_cascading_walsender: 1, currTLI: 1 2026-06-06 07:23:01.559 UTC walsender[165403] 035_standby_logical_decoding.pl STATEMENT:  START_REPLICATION SLOT "promotion_activeslot" LOGICAL 0/00000000 ("include-xids" '0', "skip-empty-xacts" '1') 2026-06-06 07:23:01.559 UTC walsender[165403] 035_standby_logical_decoding.pl LOG:  !!!WalSndSegmentOpen| nextSegNo: 6, path: pg_wal/000000010000000000000006 2026-06-06 07:23:01.559 UTC walsender[165403] 035_standby_logical_decoding.pl STATEMENT:  START_REPLICATION SLOT "promotion_activeslot" LOGICAL 0/00000000 ("include-xids" '0', "skip-empty-xacts" '1') 2026-06-06 07:23:01.559 UTC walsender[165403] 035_standby_logical_decoding.pl ERROR:  requested WAL segment 000000010000000000000006 has already been removed

In a successful run, am_cascading_walsender/RecoveryInProgress() == false and 
currTLI == 2:
2026-06-06 07:28:38.498 UTC walsender[169751] 035_standby_logical_decoding.pl LOG:  starting logical decoding for slot "promotion_activeslot" 2026-06-06 07:28:38.498 UTC walsender[169751] 035_standby_logical_decoding.pl DETAIL:  Streaming transactions committing after 0/0647C7E8, reading WAL from 0/0647C7B0. 2026-06-06 07:28:38.498 UTC walsender[169751] 035_standby_logical_decoding.pl STATEMENT:  START_REPLICATION SLOT "promotion_activeslot" LOGICAL 0/00000000 ("include-xids" '0', "skip-empty-xacts" '1') 2026-06-06 07:28:38.498 UTC walsender[169751] 035_standby_logical_decoding.pl LOG:  !!!logical_read_xlog_page| am_cascading_walsender: 0, currTLI: 2 2026-06-06 07:28:38.498 UTC walsender[169751] 035_standby_logical_decoding.pl STATEMENT:  START_REPLICATION SLOT "promotion_activeslot" LOGICAL 0/00000000 ("include-xids" '0', "skip-empty-xacts" '1') 2026-06-06 07:28:38.498 UTC walsender[169751] 035_standby_logical_decoding.pl LOG:  !!!WalSndSegmentOpen| nextSegNo: 6, path: pg_wal/000000020000000000000006

Reproduced at the commit fcd77d532, which added the test.

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=scorpion&dt=2026-06-03%2008%3A31%3A43

Best regards,
Alexander


Reply via email to