Fix race with timeline selection in logical decoding during promotion During promotion, there is a window where RecoveryInProgress() returns true but the WAL segments of the old timeline have already been removed. A logical decoding could pick up the old timeline in this window when reading a page, failing with the following error: ERROR: requested WAL segment ... has already been removed
This issue does not lead to any data correctness issue, as retrying to decode the data works in follow-up decoding attempts. It impacts availability, though. Other WAL page read callbacks have a similar issue, this commit takes care of what should be the noisiest code path: logical decoding with START_REPLICATION in a WAL sender. A TAP test, based on an injection point waiting in the startup process after the segments have been removed/recycled, is added. This part is backpatched down to v17. This issue has been causing sporadic failures in the buildfarm, and was reproducible manually. This issue happens since logical decoding on standbys exists, down to v16. Reported-by: Alexander Lakhin <[email protected]> Author: Bertrand Drouvot <[email protected]> Reviewed-by: Hayato Kuroda <[email protected]> Reviewed-by: Xuneng Zhou <[email protected]> Discussion: https://postgr.es/m/[email protected] Backpatch-through: 16 Branch ------ REL_16_STABLE Details ------- https://git.postgresql.org/pg/commitdiff/8cd687c44b62cda3cc6dec88d414e8e2837dc56b Modified Files -------------- src/backend/replication/walsender.c | 24 +++++++++++++++++++++++- 1 file changed, 23 insertions(+), 1 deletion(-)
