On 3/13/26 11:41, Ishan joshi wrote: > Hi Team, > > I found an issue with PG v16.9 patroni setup where our standby node > replication and disaster replication site replication broken with below > error. It looks like WAL corruption which later part of archive file. > > > CONTEXT: WAL redo at 184F3/F248B6F0 for Heap/LOCK: xmax: 2818115117, > off:35, infobits: [LOCK_ONLY, EXCL_LOCK], flags: 0x00; blkref #0: rel > 1663/33195/410203483, blk 25329" > PANIC: WAL contains references to invalid pages" > CONTEXT: WAL redo at 184F3/F248B6F0 for Heap/LOCK: xmax: 2818115117, > off:35, infobits: [LOCK_ONLY, EXCL_LOCK], flags: 0x00; blkref #0: > rel1663/33195/410203483, blk 25329" > WARNING: page 25329 of relation base/33195/410203483 does not exist" > INFO: no action. I am (pg-patroni-node1-0), a secondary, and following a > leader (pg-patroni-node2-0)" > [61]LOG: terminating any other active server processes" > [61]LOG: startup process (PID 72) was terminated by signal 6: Aborted" > [61]LOG: shutting down due to startup process failure" > [61]LOG: database system is shut down" > INFO: establishing a new patroni heartbeat connection to postgres" > INFO: Lock owner: pg-patroni-node2-0; I am pg-patroni-node1-0" > WARNING: Retry got exception: connection problems" > WARNING: Failed to determine PostgreSQL state from the connection, > fallingback to cached role" > INFO: Error communicating with PostgreSQL. Will try again later" > WARNING: Postgresql is not running." > > > Primary db was not impacted, however standby node and DR site > replication broken, I tried to reinit with latest backup + archive > loading from pgbackrest backup but it fails with same error once the > corrupt wal/archive file applying the changes. I had to reinit with > pgbasebackup with 40TB database which took about 45 hrs of time. > > As I understand the transcation create table ->performed DML and then > drop the table or transaction could be rollback that makes RACE > condition in WAL file creation and got failed while applying the same in > standby/DR site. >
It's hard to say what caused this, but it might be interesting to look at the WAL using pg_waldump. First at the WAL segment containing the record triggering the failure, and then also at WAL segments before that containing references to relation 1663/33195/410203483 (and especially page 25329). It is interesting this succeeded on a primary, but failed on standby. Is there anything special about the relation 1663/33195/410203483? Do you know if it's a regular / temporary table, etc? regards -- Tomas Vondra
