Currently the startup process tries the "restore_command" before the WAL files locally available under pg_wal/ [1]. I believe we should change this behavior.
== The Problem == This issue came to our attention after we migrated an application from an object storage backend, and noticed that restarting a standby node takes hours or sometimes days. We are using shared WAL archive and find it practical to have "restore_command" configured in case we would left a standby offline for a long time. However, during the short window the standby is restarted, the master manages to archive a segment. Then, the standby executes "restore_command" successfully, and continues downloading WAL from the archive causing the recovery to take orders of magnitude longer. == The Workarounds == We can possibly work around this inside the "restore_command" or by delaying the archiving. Working around inside the "restore_command" would involve checking whether the file exists under pg_wal/. This should not be easy because the WAL file may be written partially. It should be easier for Postgres to do this as it knows where to stop processing the local WAL. It should also be possible to work around this problem by delaying archiving using "wal_keep_segments", or replication slots, or simply with sleep(). Though, none of those is the correct solution to the problem. We don't need the master to keep more segments for the standbys. We already have more than enough. == The Change == This "restore_command" behavior is coming from the initial archiving and point-in-time-recovery implementation [2]. The code says "the reason is that the file in XLOGDIR could be an old, un-filled or partly-filled version that was copied and restored as part of backing up $PGDATA." This was probably a good reason in 2004, but I don't think it still is. AFAIK "pg_basebackup" eliminates this problem. Also, with this reasoning, we should also try streaming from the master before trying the local WAL, but AFAIU we don't. If there will be a consensus on fixing this, I can try to prepare a patch. The company, I am currently working for, is also interested in sponsoring a support company to fix this problem. [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/transam/xlogarchive.c;h=5c6de4989c9a#l72 [2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=66ec2db7284