On Sun, May 01, 2022 at 10:08:53PM +0900, Michael Paquier wrote: > Now, I am surprised by the third code path of durable_rename_excl(), > as of the WAL receiver doing writeTimeLineHistoryFile(), to not cause > any issues, as link() should exit with EEXIST when the startup process > grabs the same history file concurrently. It seems to me that in this > last case using durable_rename() could be an improvement and prevent > extra WAL receiver restarts as a TLI history fetched from the primary > via streaming or from some archives should be the same, but we could > be more careful, like the WAL init logic, by skipping the > durable_rename() and issuing an elog(LOG). That would not be perfect, > still a bit better than the current state of HEAD.
Skimming through at the buildfarm logs, it happens that the tests are able to see this race from time to time. Here is one such example on rorqual: https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=rorqual&dt=2022-04-20%2004%3A47%3A58&stg=recovery-check And here are the relevant logs: 2022-04-20 05:04:19.028 UTC [3109048][startup][:0] LOG: restored log file "00000002.history" from archive 2022-04-20 05:04:19.029 UTC [3109111][walreceiver][:0] LOG: fetching timeline history file for timeline 2 from primary server 2022-04-20 05:04:19.048 UTC [3109111][walreceiver][:0] FATAL: could not link file "pg_wal/xlogtemp.3109111" to "pg_wal/00000002.history": File exists [...] 2022-04-20 05:04:19.234 UTC [3109250][walreceiver][:0] LOG: started streaming WAL from primary at 0/3000000 on timeline 2 The WAL receiver upgrades the ERROR to a FATAL, and restarts streaming shortly after. Using durable_rename() would not be an issue here. -- Michael
signature.asc
Description: PGP signature