Re: [BUG] Take a long time to reach consistent after pg_rewind

surya poondla Tue, 19 May 2026 14:21:37 -0700

Subject: Re: [BUG] Take a long time to reach consistent after pg_rewind

Hi ChangAo,


Thanks for the careful diagnosis, I reproduced the hang on macOS on the
latest postgres code (It took a lot of iterations to reproduce it)
The LSN trace matches your description and I saw the below:
    minRecoveryPoint = 0/08000028
    consistent recovery state reached at = 0/08000060

In my run the standby was stuck for ~9 s; consistency was eventually
declared at 0/08000060 because a small upstream record (most likely
a RUNNING_XACTS snapshot from bgwriter) landed at 0/08000028 and let
lastReplayedEndRecPtr leap past the bad finish line.
With the new primary stopped after pg_rewind, the wait was unbounded as
expected.

Regarding the fix: the underlying issue is that minRecoveryPoint is
implicitly expected to be the end-LSN of a real WAL record, because
lastReplayedEndRecPtr (the value it gets compared against)
can only ever take such values.  All current writers respect this
expectation except pg_rewind: pg_basebackup uses the backup-end record's
EndRecPtr, and the in-running UpdateMinRecoveryPoint path
uses buffer LSNs, both of which are record-end LSNs by construction.
pg_rewind alone uses pg_current_wal_insert_lsn(), which can return a
position just past a page header when the source is idle.
That's why I'd lean toward fixing the producer (pg_rewind).

Concretely, your original suggestion having pg_rewind use
GetXLogInsertEndRecPtr() instead of GetXLogInsertRecPtr(), restores
the invariant globally, and doesn't require future call sites that compare
against minRecoveryPoint to know about page-header adjustments.

If we still want a defense-in-depth guard in CheckRecoveryConsistency() to
handle older pg_rewind binaries running against a newer server,
the v1 patch is on the right track, but I'd suggest:
  - documenting in the helper comment why exactly SizeOfXLogShortPHD /
    SizeOfXLogLongPHD past a page boundary are the only legal
    "non-record-end" minRecoveryPoint values (i.e. who can produce
    them and under what conditions);

  - auditing the other call sites that compare against
    minRecoveryPoint to confirm none of them needs the same
    adjustment, with a comment recording the conclusion.

I can put together a TAP test under src/bin/pg_rewind/t/ that forces a WAL
switch on the source, runs pg_rewind against an
otherwise-idle primary, and asserts that the rewound node reaches
consistency without further upstream activity.
Happy to send a v2 with that test if useful.

This is a liveness bug with potentially unbounded wait on idle promoted
primaries, so it does seem worth back-patching.

Regards,
Surya Poondla

Re: [BUG] Take a long time to reach consistent after pg_rewind

Reply via email to