Dear team, We recently observed a few cases where Postgres running on Linux encountered an issue with WAL segment files. Specifically, two WAL segments were linked to the same physical file after Postgres ran out of memory and the OOM killer terminated one of its processes. This resulted in the WAL segments overwriting each other and Postgres failing a later recovery.
We found this fix [1] that has been applied to Postgres 16, but the cases we observed were running Postgres 15. Given that older major versions will be supported for a good number of years, and the potential for irrecoverability exists (even if rare), we would like to discuss the possibility of back-patching this fix. Are there any technical reasons not to back-patch this fix to older major versions? Thank you for your consideration. Sincerely, Robert Pang [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=dac1ff3 On Sat, May 7, 2022 at 1:19 AM Michael Paquier <mich...@paquier.xyz> wrote: > > On Thu, May 05, 2022 at 08:10:02PM +0900, Michael Paquier wrote: > > I'd agree with removing all the callers at the end. pgrename() is > > quite robust on Windows, but I'd keep the two checks in > > writeTimeLineHistory(), as the logic around findNewestTimeLine() would > > consider a past TLI history file as in-use even if we have a crash > > just after the file got created in the same path by the same standby, > > and the WAL segment init part. Your patch does that. > > As v16 is now open for business, I have revisited this change and > applied 0001 to change all the callers (aka removal of the assertion > for the WAL receiver when it overwrites a TLI history file). The > commit log includes details about the reasoning of all the areas > changed, for clarity, as of the WAL recycling part, the TLI history > file part and basic_archive. > -- > Michael