On 2020/10/28 21:02, Sergei Kornilov wrote:
Hello
Sorry for late response.
> ... but what's the corresponding hazard here, exactly? It doesn't seem
> that there's any way in which the decision one process makes affects
> the decision the other process makes. There's still a race condition:
> it's possible for a walsender
Did you mean walreceiver here?
It's logical walsender. restore_command is used within
logical_read_xlog_page() via XLogReadDetermineTimeline().
Still have no idea what's the corresponding hazard here.
> to use the old restore_command after the
> startup process had already used the new one, or the other way around.
> However, it doesn't seem like that should confuse anything inside the
> server, and therefore I'm not sure we need to code around it.
I came up with following scenario. Let's say we have xlog files 1,2,3
in dir1 and files 4,5 in dir2. If startup process had only handled
files 1 and 2, before we switched restore_command from reading dir1 to
reading dir2, it will fail to find next file. IIUC, it will assume
that recovery is done, start server and walreceiver. The walreceiver
will fail as well. I don't know, how realistic is this case, though.
That operation is somewhat bogus, if the server is not in standby
mode. In standby mode, startup waits for the next segment safely.
I think it's pilot error. It is already possible to change anything in
restore_command by wrapping real command into some script:
restore_command = '/bin/restore_wal.sh "%f" "%p"'
And one can simple replace this file with something else with different logic.
Or even by using some command with separate own settings. Real world example (
https://github.com/wal-g/wal-g ):
restore_command = '. /etc/wal-g/WALG_AWS_ENV; wal-g wal-fetch "%f" "%p"'
And it is possible to change the real WAL source in ENV script without changing
the restore_command. We can't track this, so I not see new issues here.
Sergey, could you please attach this thread to the upcoming CF, if
you're going to continue working on it.
Sure, I created one: https://commitfest.postgresql.org/30/2802/
+1 to mark restore_command as PGC_SIGHUP.
Currently when restore_command is not set, archive recovery fails
at the beginning. With the patch, how should we treat the case where
retore_command is reset to empty during archive recovery? We should
reject that change of restore_command?
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION