Subject: Re: [BUG] Take a long time to reach consistent after pg_rewind
Hi ChangAo,
Thanks for the careful diagnosis, I reproduced the hang on macOS on the
latest postgres code (It took a lot of iterations to reproduce it)
The LSN trace matches your description and I saw the below:
minRecoveryPoint = 0/08000028
consistent recovery state reached at = 0/08000060
In my run the standby was stuck for ~9 s; consistency was eventually
declared at 0/08000060 because a small upstream record (most likely
a RUNNING_XACTS snapshot from bgwriter) landed at 0/08000028 and let
lastReplayedEndRecPtr leap past the bad finish line.
With the new primary stopped after pg_rewind, the wait was unbounded as
expected.
Regarding the fix: the underlying issue is that minRecoveryPoint is
implicitly expected to be the end-LSN of a real WAL record, because
lastReplayedEndRecPtr (the value it gets compared against)
can only ever take such values. All current writers respect this
expectation except pg_rewind: pg_basebackup uses the backup-end record's
EndRecPtr, and the in-running UpdateMinRecoveryPoint path
uses buffer LSNs, both of which are record-end LSNs by construction.
pg_rewind alone uses pg_current_wal_insert_lsn(), which can return a
position just past a page header when the source is idle.
That's why I'd lean toward fixing the producer (pg_rewind).
Concretely, your original suggestion having pg_rewind use
GetXLogInsertEndRecPtr() instead of GetXLogInsertRecPtr(), restores
the invariant globally, and doesn't require future call sites that compare
against minRecoveryPoint to know about page-header adjustments.
If we still want a defense-in-depth guard in CheckRecoveryConsistency() to
handle older pg_rewind binaries running against a newer server,
the v1 patch is on the right track, but I'd suggest:
- documenting in the helper comment why exactly SizeOfXLogShortPHD /
SizeOfXLogLongPHD past a page boundary are the only legal
"non-record-end" minRecoveryPoint values (i.e. who can produce
them and under what conditions);
- auditing the other call sites that compare against
minRecoveryPoint to confirm none of them needs the same
adjustment, with a comment recording the conclusion.
I can put together a TAP test under src/bin/pg_rewind/t/ that forces a WAL
switch on the source, runs pg_rewind against an
otherwise-idle primary, and asserts that the rewound node reaches
consistency without further upstream activity.
Happy to send a v2 with that test if useful.
This is a liveness bug with potentially unbounded wait on idle promoted
primaries, so it does seem worth back-patching.
Regards,
Surya Poondla