Hi PostgreSQL community. I debugged an instance where a PostgreSQL standby would not switch to streaming replication when the `restore_command` fails. I first posted this to pgsql-admin mailing list, but now trying here as I got no response.
*Expectation* I expect PostgreSQL to try switching to streaming replication if the `restore_command` fails. *What happens* PostgreSQL attempts to restore the previously restored WAL segment and then retries the failed segment. However, because the primary produces WAL at a high rate, the WAL file now exists and PostgreSQL does not try to switch to streaming replication. *Context* Running PostgreSQL 15.7 in Kubernetes using CloudNative PostgreSQL Operator. *Logs* I configured PostgreSQL to emit DEBUG3 level logs. Newest logs first, oldest last. got WAL segment from archive executing restore command "/controller/manager wal-restore --log-destination /controller/log/postgres.json *000000410000A7BA00000058* pg_wal/RECOVERYXLOG" got WAL segment from archive executing restore command "/controller/manager wal-restore --log-destination /controller/log/postgres.json *000000410000A7BA00000057* pg_wal/RECOVERYXLOG" could not open file "pg_wal/*000000410000A7BA00000058*": No such file or directory could not restore file "*000000410000A7BA00000058*" from archive: child process exited with exit code 1 executing restore command "/controller/manager wal-restore --log-destination /controller/log/postgres.json *000000410000A7BA00000058* pg_wal/RECOVERYXLOG" got WAL segment from archive executing restore command "/controller/manager wal-restore --log-destination /controller/log/postgres.json *000000410000A7BA00000057* pg_wal/RECOVERYXLOG" Notice that when *000000410000A7BA00000058* failed, PostgreSQL asked for *000000410000A7BA00000057* which it had already restored. Aftwards, it asks about *000000410000A7BA00000058* once again. *Problem* This is problematic because the standby will never switch to streaming replication. *Workaround* We can get the PostgreSQL replica to become in-sync if we change the command to `/bin/false` when we are withing `wal_keep_size`. *Question* Is this the expected behaviour? I expect the function `WaitForWALToBecomeAvailable` to switch to streaming replication once a single `restore_command` fails. This also happens when `/bin/false` is used instead. Any help would be greatly appreciated /Kasper Føns
